16.4 Cluster troubleshooting

Table of Contents

Typical Troubleshooting Mindset and Workflow

Troubleshooting an OpenShift cluster is less about memorizing every command and more about having a systematic approach. A practical workflow:

Define the problem clearly

What is broken? (app, route, pod scheduling, cluster API, etc.)
Since when? After which change (upgrade, config, deployment)?
Is it localized (single project/node) or widespread (cluster-wide)?

Check blast radius

One user or all users?
One namespace or all namespaces?
One node or multiple nodes / zones?

Confirm symptoms with basic health checks

Can you oc login?
Is the API responsive?
Are core namespaces (e.g. openshift-apiserver, openshift-operator-lifecycle-manager) healthy?

Narrow down the layer

Application layer (pods, Deployments, Routes)
Namespace / quota / RBAC issues
Cluster services (DNS, image registry, SDN, Ingress)
Control plane (API server, controllers, etcd)
Infrastructure (nodes, network, storage)

Form a hypothesis, test, adjust

Make only one change at a time.
Log what you did and why (for handover and future reference).

Know when to escalate

Gather diagnostics (logs, events, sosreport, must-gather) before escalating to SRE or vendor.

The rest of this chapter focuses on practical tools and patterns to support this workflow.

Core Tools for Troubleshooting

Basic `oc` diagnostics

Common starting commands:

Cluster-level view:

oc status
oc whoami
oc get clusterversion
oc get nodes
oc get clusteroperators

Namespaced resources:

oc get pods
oc get pods -A
oc get events
oc describe <resource> <name>

Debugging pods and nodes:

oc logs
oc rsh
oc exec
oc debug

Examples:

# Overall cluster condition
oc status
# Check all cluster operators
oc get clusteroperators
# Describe one operator in detail
oc describe clusteroperator authentication
# Check pod state across all namespaces
oc get pods -A
# Verify can you reach the API smoothly
oc get nodes

Using `oc describe` effectively

oc describe is often more useful than oc get when troubleshooting:

Shows events related to the resource (e.g. failed pulls, scheduling issues).
Shows finalizers, conditions, and last transition times.
For pods, lists container states (waiting, running, terminated) and reasons.

Example:

oc describe pod my-app-76cd49fdc6-8q4kt

Focus on:

Status → reasons like ImagePullBackOff, CrashLoopBackOff, CreateContainerError
Events → time-ordered warnings and errors
Conditions → Ready, PodScheduled, etc.

`oc debug` for deep inspection

oc debug lets you:

Start a debug pod on a specific node
Get a transient shell inside a pod’s environment
Run host-level tools without logging into the node via SSH (if SSH is restricted)

Examples:

# Debug a faulty pod with its filesystem mounted
oc debug pod/my-app-76cd49fdc6-8q4kt
# Start a debug pod on a specific node
oc debug node/ip-10-0-1-23.ec2.internal
# In the debug session, you can inspect host paths like /host/etc, /host/var

Use oc debug node to inspect node-level issues without breaking the node’s configuration.

Collecting Logs

Application pod logs

Basic usage:

# Last logs from the main container
oc logs my-app-76cd49fdc6-8q4kt
# Logs from a named container in a pod
oc logs my-app-76cd49fdc6-8q4kt -c sidecar
# Stream logs
oc logs -f my-app-76cd49fdc6-8q4kt
# Previous container instance (useful in CrashLoopBackOff)
oc logs my-app-76cd49fdc6-8q4kt -p

OpenShift component logs

Most OpenShift platform components run as pods in system namespaces.

Examples:

# List pods in a system namespace
oc get pods -n openshift-apiserver
# Logs for a specific OpenShift component pod
oc logs -n openshift-apiserver apiserver-6f488978d6-f7nrn

Common namespaces to inspect:

openshift-apiserver
openshift-controller-manager
openshift-kube-apiserver
openshift-kube-controller-manager
openshift-kube-scheduler
openshift-etcd
openshift-ingress
openshift-sdn or openshift-ovn-kubernetes (depending on SDN)
openshift-image-registry
openshift-authentication

Focus on Warning or Error lines and repeated failures.

Using `oc adm` and `must-gather`

For more systematic cluster diagnostics, especially before escalation or when there are widespread issues, OpenShift provides admin-level tools.

`oc adm top` and quick resource checks

oc adm top nodes – CPU/memory usage per node
oc adm top pods – CPU/memory usage per pod (optionally namespaced)

Examples:

oc adm top nodes
oc adm top pods -A

Useful to identify:

Nodes under memory pressure
Pods consuming excessive resources

`oc adm must-gather`

must-gather collects a comprehensive snapshot of cluster state and logs.

Basic usage with default image:

oc adm must-gather

This will:

Run collector pods
Gather:

API objects
Cluster operator states
Node details
Logs from platform components

You can specify a custom image (for vendor support or specialized collections) with --image. The output is a directory you can compress and share with support or keep for offline analysis.

Best practices:

Run as a user with cluster-admin privileges.
Include timestamp in directory or archive name.
Run before making major changes that may alter cluster state.

Identifying and Working with Common Failure Patterns

Pods stuck in Pending

Typical root causes:

No matching node (taints, labels, node selectors, affinities)
Insufficient cluster resources (CPU, memory)
Unsatisfied PVC (storage not available)
Network policy or CNI issues (less common for scheduling)

Diagnostics:

oc get pods
oc describe pod <pod-name>

Check in oc describe:

PodScheduled condition: messages like 0/5 nodes are available: 1 node(s) had taint {...}, 4 Insufficient memory, etc.
Events: errors like 0/3 nodes are available: 3 node(s) didn't match node selector.

Remediation examples:

Adjust resources.requests so pods can fit on available nodes.
Ensure node labels match configuration (nodeSelector, affinity).
Remove or adjust tolerations or taints as intended.
Check missing PVCs (see storage troubleshooting later in this chapter).

Pods in CrashLoopBackOff

Common causes:

Application bug or misconfiguration
Missing environment variables or Secrets
Startup commands failing (command/args misconfigured)
Permissions in the container or volume

Diagnostics:

Check pod status and recent restarts:

   oc get pod <pod-name>
   oc describe pod <pod-name>

Look at previous container logs:

   oc logs <pod-name> -p

If needed, modify the container’s entrypoint or create a copy of the Deployment with a longer sleep to get an interactive shell.

Remediation:

Fix configuration via ConfigMap or Secret.
Adjust resource limits if OOMKilled (check container termination reason).
Ensure mounted volumes are accessible (permissions, path existence).

Image pull failures

Symptoms:

Pod in ImagePullBackOff or ErrImagePull

Diagnostics with:

oc describe pod <pod-name>

Look for events such as:

Failed to pull image "registry.example.com/app:tag": rpc error: ...
unauthorized: authentication required
manifest unknown

Typical causes:

Wrong image name or tag
Private registry credentials missing or invalid
Network issues reaching the registry
Image stream misconfiguration (if using OpenShift image streams)

Checks:

Validate image reference in the Deployment/DeploymentConfig.
Verify pull secret in the namespace:

  oc get secrets
  oc describe secret <pull-secret-name>

Ensure ServiceAccount is configured to use the correct imagePullSecret.

Networking and connectivity issues

High-level networking problems can manifest as:

Pods cannot reach external services.
Services unresolved by name (DNS).
Routes unreachable from outside.
Health checks (readiness/liveness) failing.

Checking Service and endpoint health

For a given Service:

oc get svc my-service
oc describe svc my-service
oc get endpoints my-service

Confirm:

Endpoints exist and reference Ready pods.
Ports match container ports.

If endpoints are empty:

Backtrack to pod labels: ensure they match Service selector.
Validate pod readiness probes.

Testing connectivity from inside the cluster

Use a temporary debug pod:

oc run -it net-debug --image=registry.access.redhat.com/ubi9/ubi-minimal -- sh

Then test:

DNS: getent hosts my-service.my-namespace.svc.cluster.local
HTTP: curl http://my-service:8080/healthz
External reachability: curl https://example.com

If DNS fails, inspect DNS-related components (e.g. openshift-dns pods).

Route and Ingress issues

For a Route:

oc get route my-route
oc describe route my-route

Focus on:

Hostname
TLS configuration
Admitted status (conditions)
Target Service

If the Route is not admitted, check Ingress Controller pods in openshift-ingress namespace.

From outside the cluster, verify:

DNS record for the route host resolves to the ingress IP or load balancer.
Load balancer health checks are passing (if applicable).

Use curl -v from a client to see HTTP status codes and TLS errors.

Storage-related problems

Common observed states:

Pods stuck in Pending due to unsatisfied PVC.
Mounted volumes are read-only.
I/O errors or timeouts in application logs.

Diagnosing PVC and PV issues

Check PVC:

oc get pvc
oc describe pvc <pvc-name>

Look for:

Status: Pending, Bound
Events: provisioning failures, access mode mismatches
StorageClass: correct class referenced?

Check corresponding PV (if statically provisioned or already created):

oc get pv
oc describe pv <pv-name>

Possible issues:

No StorageClass or wrong StorageClass configured.
Requested size or accessMode not supported by underlying storage.
Storage provider not reachable or misconfigured at the infrastructure level.

If pods are failing with mount errors, inspect node logs via oc debug node and check:

CSI driver pods (in openshift-cluster-csi-drivers or provider-specific namespace).
Kubelet logs for mount failures.

Control Plane and Cluster Operator Troubleshooting

Using ClusterOperators as health indicators

Cluster Operators express the health and upgrade status of major OpenShift components.

Basic inspection:

oc get clusteroperators

Columns to interpret:

AVAILABLE
PROGRESSING
DEGRADED

Example pattern:

DEGRADED=True – component encountering persistent errors.
PROGRESSING=True – upgrading or reconciling; prolonged True could indicate a stuck upgrade.
AVAILABLE=False – component not serving requests, potentially blocking cluster operations.

Deep-dive into a specific operator:

oc describe clusteroperator ingress

Look at:

Conditions → reasons and messages
Related objects → namespaces, deployments
Last transition time, frequency of changes

Remediation often involves:

Examining pods in the operator’s namespace.
Fixing underlying resources (e.g. a broken DaemonSet, misconfigured ConfigMap).
Ensuring nodes meet requirements for that component.

API server and etcd issues

Symptoms:

oc commands timeout or intermittently fail.
Console cannot connect to API.
Cluster operators in degraded state related to kube-apiserver or etcd.

Investigations:

Check APIServer pods:

  oc get pods -n openshift-kube-apiserver
  oc logs -n openshift-kube-apiserver <pod-name>

Check etcd pods:

  oc get pods -n openshift-etcd
  oc logs -n openshift-etcd <pod-name>

Watch for:

Quorum issues
Disk space or latency problems
Certificate errors

These issues are sensitive; for major problems, collect must-gather and follow vendor-specific runbooks rather than making manual changes to etcd.

Node-Level Troubleshooting

Cluster nodes can impact scheduling, pod performance, and overall availability.

Detecting node issues

Quick overview:

oc get nodes

Node Status may show:

Ready
NotReady
SchedulingDisabled

For a specific node:

oc describe node <node-name>

Check:

Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable.
Allocatable vs requested resources.
Events: Kubelet issues, network plugin failures.

If a node is NotReady:

Pods may be evicted or stuck.
New pods will not be scheduled there.

Inspecting a node with `oc debug`

Instead of direct SSH, use:

oc debug node/<node-name>

Inside the debug session:

Host filesystem is typically mounted under /host.
Basic checks:

Disk space: chroot /host df -h
System logs (depending on OS config): chroot /host journalctl -u kubelet
Network interfaces: chroot /host ip a

Use oc debug to avoid manual host modifications and respect platform practices.

Draining and cordoning nodes

When a node is unhealthy or needs maintenance, coordinate pod movement:

Cordon – mark node unschedulable (no new pods):

  oc adm cordon <node-name>

Drain – evict pods and prepare node for maintenance:

  oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data

Uncordon – return node to service:

  oc adm uncordon <node-name>

Draining is covered more thoroughly in node maintenance; from a troubleshooting perspective, use it to isolate problematic nodes and verify whether issues follow a specific node or not.

Troubleshooting the OpenShift Web Console

Typical symptoms:

Console does not load or returns errors.
Certain pages fail or time out.
User can use oc but not the web console.

Key components:

Console deployment in openshift-console namespace.
Console route (TLS, hostname).
OAuth / authentication integration.

Checks:

oc get pods -n openshift-console
oc logs -n openshift-console <console-pod>
oc get routes -n openshift-console
oc describe route console -n openshift-console

Verify:

Route host resolves to correct IP / load balancer.
Console pods are Ready.
No repeated authentication or backend errors in logs.

If console cannot reach API (but oc works), look for:

Network restrictions between console pods and API endpoint.
Misconfigurations in console config (ConfigMaps in openshift-console).

Using Monitoring and Logging for Troubleshooting

The monitoring and logging stack is covered elsewhere in detail; here the focus is on how to use it during incident response.

Quick checks with built-in monitoring

Within the Admin console (if accessible):

Check Cluster Status tiles (alerts, operator health).
Check Monitoring → Alerts:

Look for firing alerts (e.g. KubeNodeNotReady, KubeAPIDown, KubePersistentVolumeFillingUp).
Use alert annotations as hints for root cause and suggested actions.

From CLI, you can:

Inspect the Prometheus route and see if it’s up (for advanced users).
Use oc adm top as described earlier for resource usage.

Logging considerations

Depending on how cluster logging is configured (e.g. via the OpenShift Logging Operator), you might have:

Application logs aggregated centrally.
Infrastructure and audit logs stored in an external system.

During troubleshooting:

Correlate pod events with logs by timestamp.
Use labels (namespace, app) to filter application logs.
Check infrastructure logs (kubelet, SDN, storage) for repeated patterns.

If centralized logging is not available, rely on oc logs and node-level inspection via oc debug.

Systematic Root Cause Analysis and Documentation

Troubleshooting doesn’t end when the symptom disappears. For maintainable operations:

Capture timeline

When did it start?
What changed right before? (upgrade, new operator, config change)

Document evidence

Commands used, key outputs.
Screenshots or exported logs as needed.
Alerts that were firing and their resolution.

Identify root cause

Misconfiguration?
Capacity limit?
Platform bug?
External system failure (storage, DNS, load balancer)?

Define preventive actions

Hardening configuration (quotas, limits, health checks).
Adding or tuning alerts.
Automating checks in CI/CD or cluster policies.

Share knowledge

Internal runbooks or knowledge base entries.
Post-incident review with your team.

Building this habit improves overall cluster reliability and shortens future troubleshooting sessions.

When and How to Escalate

Sometimes, issues require vendor support or a specialized SRE/operations team.

Before escalating:

Run oc adm must-gather and save output.
Capture:

oc get clusteroperators
oc get nodes
oc get co -o yaml and oc get nodes -o wide if practical.
Logs from affected namespaces (e.g. storage, ingress, etcd).

Write a brief summary:

Symptoms
Scope (which namespaces/nodes/users)
Recent changes
Steps already tried and their results

This structured context helps others quickly understand the situation and significantly reduces time to resolution.

By combining these tools, patterns, and practices, you build a repeatable approach to diagnosing and resolving most OpenShift cluster issues encountered in day-to-day operations.

Comments

Please login to add a comment.

Don't have an account? Register now!

16.4 Cluster troubleshooting

Typical Troubleshooting Mindset and Workflow

Core Tools for Troubleshooting

Basic `oc` diagnostics

Using `oc describe` effectively

`oc debug` for deep inspection

Collecting Logs

Application pod logs

OpenShift component logs

Using `oc adm` and `must-gather`

`oc adm top` and quick resource checks

`oc adm must-gather`

Identifying and Working with Common Failure Patterns

Pods stuck in Pending

Pods in CrashLoopBackOff

Image pull failures

Networking and connectivity issues

Checking Service and endpoint health

Testing connectivity from inside the cluster

Route and Ingress issues

Storage-related problems

Diagnosing PVC and PV issues

Control Plane and Cluster Operator Troubleshooting

Using ClusterOperators as health indicators

API server and etcd issues

Node-Level Troubleshooting

Detecting node issues

Inspecting a node with `oc debug`

Draining and cordoning nodes

Troubleshooting the OpenShift Web Console

Using Monitoring and Logging for Troubleshooting

Quick checks with built-in monitoring

Logging considerations

Systematic Root Cause Analysis and Documentation

When and How to Escalate

Comments

Where to Move