Table of Contents
Typical Troubleshooting Mindset and Workflow
Troubleshooting an OpenShift cluster is less about memorizing every command and more about having a systematic approach. A practical workflow:
- Define the problem clearly
- What is broken? (app, route, pod scheduling, cluster API, etc.)
- Since when? After which change (upgrade, config, deployment)?
- Is it localized (single project/node) or widespread (cluster-wide)?
- Check blast radius
- One user or all users?
- One namespace or all namespaces?
- One node or multiple nodes / zones?
- Confirm symptoms with basic health checks
- Can you
oc login? - Is the API responsive?
- Are core namespaces (e.g.
openshift-apiserver,openshift-operator-lifecycle-manager) healthy? - Narrow down the layer
- Application layer (pods, Deployments, Routes)
- Namespace / quota / RBAC issues
- Cluster services (DNS, image registry, SDN, Ingress)
- Control plane (API server, controllers, etcd)
- Infrastructure (nodes, network, storage)
- Form a hypothesis, test, adjust
- Make only one change at a time.
- Log what you did and why (for handover and future reference).
- Know when to escalate
- Gather diagnostics (logs, events, sosreport,
must-gather) before escalating to SRE or vendor.
The rest of this chapter focuses on practical tools and patterns to support this workflow.
Core Tools for Troubleshooting
Basic `oc` diagnostics
Common starting commands:
- Cluster-level view:
oc statusoc whoamioc get clusterversionoc get nodesoc get clusteroperators- Namespaced resources:
oc get podsoc get pods -Aoc get eventsoc describe <resource> <name>- Debugging pods and nodes:
oc logsoc rshoc execoc debug
Examples:
# Overall cluster condition
oc status
# Check all cluster operators
oc get clusteroperators
# Describe one operator in detail
oc describe clusteroperator authentication
# Check pod state across all namespaces
oc get pods -A
# Verify can you reach the API smoothly
oc get nodesUsing `oc describe` effectively
oc describe is often more useful than oc get when troubleshooting:
- Shows events related to the resource (e.g. failed pulls, scheduling issues).
- Shows finalizers, conditions, and last transition times.
- For pods, lists container states (waiting, running, terminated) and reasons.
Example:
oc describe pod my-app-76cd49fdc6-8q4ktFocus on:
Status→ reasons likeImagePullBackOff,CrashLoopBackOff,CreateContainerErrorEvents→ time-ordered warnings and errorsConditions→Ready,PodScheduled, etc.
`oc debug` for deep inspection
oc debug lets you:
- Start a debug pod on a specific node
- Get a transient shell inside a pod’s environment
- Run host-level tools without logging into the node via SSH (if SSH is restricted)
Examples:
# Debug a faulty pod with its filesystem mounted
oc debug pod/my-app-76cd49fdc6-8q4kt
# Start a debug pod on a specific node
oc debug node/ip-10-0-1-23.ec2.internal
# In the debug session, you can inspect host paths like /host/etc, /host/var
Use oc debug node to inspect node-level issues without breaking the node’s configuration.
Collecting Logs
Application pod logs
Basic usage:
# Last logs from the main container
oc logs my-app-76cd49fdc6-8q4kt
# Logs from a named container in a pod
oc logs my-app-76cd49fdc6-8q4kt -c sidecar
# Stream logs
oc logs -f my-app-76cd49fdc6-8q4kt
# Previous container instance (useful in CrashLoopBackOff)
oc logs my-app-76cd49fdc6-8q4kt -pOpenShift component logs
Most OpenShift platform components run as pods in system namespaces.
Examples:
# List pods in a system namespace
oc get pods -n openshift-apiserver
# Logs for a specific OpenShift component pod
oc logs -n openshift-apiserver apiserver-6f488978d6-f7nrnCommon namespaces to inspect:
openshift-apiserveropenshift-controller-manageropenshift-kube-apiserveropenshift-kube-controller-manageropenshift-kube-scheduleropenshift-etcdopenshift-ingressopenshift-sdnoropenshift-ovn-kubernetes(depending on SDN)openshift-image-registryopenshift-authentication
Focus on Warning or Error lines and repeated failures.
Using `oc adm` and `must-gather`
For more systematic cluster diagnostics, especially before escalation or when there are widespread issues, OpenShift provides admin-level tools.
`oc adm top` and quick resource checks
oc adm top nodes– CPU/memory usage per nodeoc adm top pods– CPU/memory usage per pod (optionally namespaced)
Examples:
oc adm top nodes
oc adm top pods -AUseful to identify:
- Nodes under memory pressure
- Pods consuming excessive resources
`oc adm must-gather`
must-gather collects a comprehensive snapshot of cluster state and logs.
Basic usage with default image:
oc adm must-gatherThis will:
- Run collector pods
- Gather:
- API objects
- Cluster operator states
- Node details
- Logs from platform components
You can specify a custom image (for vendor support or specialized collections) with --image. The output is a directory you can compress and share with support or keep for offline analysis.
Best practices:
- Run as a user with cluster-admin privileges.
- Include timestamp in directory or archive name.
- Run before making major changes that may alter cluster state.
Identifying and Working with Common Failure Patterns
Pods stuck in Pending
Typical root causes:
- No matching node (taints, labels, node selectors, affinities)
- Insufficient cluster resources (CPU, memory)
- Unsatisfied PVC (storage not available)
- Network policy or CNI issues (less common for scheduling)
Diagnostics:
oc get pods
oc describe pod <pod-name>
Check in oc describe:
PodScheduledcondition: messages like0/5 nodes are available: 1 node(s) had taint {...}, 4 Insufficient memory, etc.- Events: errors like
0/3 nodes are available: 3 node(s) didn't match node selector.
Remediation examples:
- Adjust
resources.requestsso pods can fit on available nodes. - Ensure node labels match configuration (
nodeSelector,affinity). - Remove or adjust
tolerationsor taints as intended. - Check missing PVCs (see storage troubleshooting later in this chapter).
Pods in CrashLoopBackOff
Common causes:
- Application bug or misconfiguration
- Missing environment variables or Secrets
- Startup commands failing (
command/argsmisconfigured) - Permissions in the container or volume
Diagnostics:
- Check pod status and recent restarts:
oc get pod <pod-name>
oc describe pod <pod-name>- Look at previous container logs:
oc logs <pod-name> -p- If needed, modify the container’s entrypoint or create a copy of the Deployment with a longer sleep to get an interactive shell.
Remediation:
- Fix configuration via
ConfigMaporSecret. - Adjust resource limits if OOMKilled (check container termination reason).
- Ensure mounted volumes are accessible (permissions, path existence).
Image pull failures
Symptoms:
- Pod in
ImagePullBackOfforErrImagePull
Diagnostics with:
oc describe pod <pod-name>Look for events such as:
Failed to pull image "registry.example.com/app:tag": rpc error: ...unauthorized: authentication requiredmanifest unknown
Typical causes:
- Wrong image name or tag
- Private registry credentials missing or invalid
- Network issues reaching the registry
- Image stream misconfiguration (if using OpenShift image streams)
Checks:
- Validate image reference in the Deployment/DeploymentConfig.
- Verify pull secret in the namespace:
oc get secrets
oc describe secret <pull-secret-name>- Ensure ServiceAccount is configured to use the correct imagePullSecret.
Networking and connectivity issues
High-level networking problems can manifest as:
- Pods cannot reach external services.
- Services unresolved by name (DNS).
- Routes unreachable from outside.
- Health checks (readiness/liveness) failing.
Checking Service and endpoint health
For a given Service:
oc get svc my-service
oc describe svc my-service
oc get endpoints my-serviceConfirm:
- Endpoints exist and reference Ready pods.
- Ports match container ports.
If endpoints are empty:
- Backtrack to pod labels: ensure they match Service selector.
- Validate pod readiness probes.
Testing connectivity from inside the cluster
Use a temporary debug pod:
oc run -it net-debug --image=registry.access.redhat.com/ubi9/ubi-minimal -- shThen test:
- DNS:
getent hosts my-service.my-namespace.svc.cluster.local - HTTP:
curl http://my-service:8080/healthz - External reachability:
curl https://example.com
If DNS fails, inspect DNS-related components (e.g. openshift-dns pods).
Route and Ingress issues
For a Route:
oc get route my-route
oc describe route my-routeFocus on:
- Hostname
- TLS configuration
- Admitted status (conditions)
- Target Service
If the Route is not admitted, check Ingress Controller pods in openshift-ingress namespace.
From outside the cluster, verify:
- DNS record for the route host resolves to the ingress IP or load balancer.
- Load balancer health checks are passing (if applicable).
Use curl -v from a client to see HTTP status codes and TLS errors.
Storage-related problems
Common observed states:
- Pods stuck in Pending due to unsatisfied PVC.
- Mounted volumes are read-only.
- I/O errors or timeouts in application logs.
Diagnosing PVC and PV issues
Check PVC:
oc get pvc
oc describe pvc <pvc-name>Look for:
Status:Pending,BoundEvents: provisioning failures, access mode mismatchesStorageClass: correct class referenced?
Check corresponding PV (if statically provisioned or already created):
oc get pv
oc describe pv <pv-name>Possible issues:
- No StorageClass or wrong StorageClass configured.
- Requested size or accessMode not supported by underlying storage.
- Storage provider not reachable or misconfigured at the infrastructure level.
If pods are failing with mount errors, inspect node logs via oc debug node and check:
- CSI driver pods (in
openshift-cluster-csi-driversor provider-specific namespace). - Kubelet logs for mount failures.
Control Plane and Cluster Operator Troubleshooting
Using ClusterOperators as health indicators
Cluster Operators express the health and upgrade status of major OpenShift components.
Basic inspection:
oc get clusteroperatorsColumns to interpret:
AVAILABLEPROGRESSINGDEGRADED
Example pattern:
DEGRADED=True– component encountering persistent errors.PROGRESSING=True– upgrading or reconciling; prolongedTruecould indicate a stuck upgrade.AVAILABLE=False– component not serving requests, potentially blocking cluster operations.
Deep-dive into a specific operator:
oc describe clusteroperator ingressLook at:
Conditions→ reasons and messages- Related objects → namespaces, deployments
- Last transition time, frequency of changes
Remediation often involves:
- Examining pods in the operator’s namespace.
- Fixing underlying resources (e.g. a broken DaemonSet, misconfigured ConfigMap).
- Ensuring nodes meet requirements for that component.
API server and etcd issues
Symptoms:
occommands timeout or intermittently fail.- Console cannot connect to API.
- Cluster operators in degraded state related to
kube-apiserveroretcd.
Investigations:
- Check APIServer pods:
oc get pods -n openshift-kube-apiserver
oc logs -n openshift-kube-apiserver <pod-name>- Check etcd pods:
oc get pods -n openshift-etcd
oc logs -n openshift-etcd <pod-name>Watch for:
- Quorum issues
- Disk space or latency problems
- Certificate errors
These issues are sensitive; for major problems, collect must-gather and follow vendor-specific runbooks rather than making manual changes to etcd.
Node-Level Troubleshooting
Cluster nodes can impact scheduling, pod performance, and overall availability.
Detecting node issues
Quick overview:
oc get nodesNode Status may show:
ReadyNotReadySchedulingDisabled
For a specific node:
oc describe node <node-name>Check:
- Conditions:
Ready,MemoryPressure,DiskPressure,PIDPressure,NetworkUnavailable. - Allocatable vs requested resources.
- Events: Kubelet issues, network plugin failures.
If a node is NotReady:
- Pods may be evicted or stuck.
- New pods will not be scheduled there.
Inspecting a node with `oc debug`
Instead of direct SSH, use:
oc debug node/<node-name>Inside the debug session:
- Host filesystem is typically mounted under
/host. - Basic checks:
- Disk space:
chroot /host df -h - System logs (depending on OS config):
chroot /host journalctl -u kubelet - Network interfaces:
chroot /host ip a
Use oc debug to avoid manual host modifications and respect platform practices.
Draining and cordoning nodes
When a node is unhealthy or needs maintenance, coordinate pod movement:
- Cordon – mark node unschedulable (no new pods):
oc adm cordon <node-name>- Drain – evict pods and prepare node for maintenance:
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data- Uncordon – return node to service:
oc adm uncordon <node-name>Draining is covered more thoroughly in node maintenance; from a troubleshooting perspective, use it to isolate problematic nodes and verify whether issues follow a specific node or not.
Troubleshooting the OpenShift Web Console
Typical symptoms:
- Console does not load or returns errors.
- Certain pages fail or time out.
- User can use
ocbut not the web console.
Key components:
- Console deployment in
openshift-consolenamespace. - Console route (TLS, hostname).
- OAuth / authentication integration.
Checks:
oc get pods -n openshift-console
oc logs -n openshift-console <console-pod>
oc get routes -n openshift-console
oc describe route console -n openshift-consoleVerify:
- Route host resolves to correct IP / load balancer.
- Console pods are
Ready. - No repeated authentication or backend errors in logs.
If console cannot reach API (but oc works), look for:
- Network restrictions between console pods and API endpoint.
- Misconfigurations in console config (ConfigMaps in
openshift-console).
Using Monitoring and Logging for Troubleshooting
The monitoring and logging stack is covered elsewhere in detail; here the focus is on how to use it during incident response.
Quick checks with built-in monitoring
Within the Admin console (if accessible):
- Check Cluster Status tiles (alerts, operator health).
- Check Monitoring → Alerts:
- Look for firing alerts (e.g.
KubeNodeNotReady,KubeAPIDown,KubePersistentVolumeFillingUp). - Use alert annotations as hints for root cause and suggested actions.
From CLI, you can:
- Inspect the Prometheus route and see if it’s up (for advanced users).
- Use
oc adm topas described earlier for resource usage.
Logging considerations
Depending on how cluster logging is configured (e.g. via the OpenShift Logging Operator), you might have:
- Application logs aggregated centrally.
- Infrastructure and audit logs stored in an external system.
During troubleshooting:
- Correlate pod events with logs by timestamp.
- Use labels (namespace, app) to filter application logs.
- Check infrastructure logs (kubelet, SDN, storage) for repeated patterns.
If centralized logging is not available, rely on oc logs and node-level inspection via oc debug.
Systematic Root Cause Analysis and Documentation
Troubleshooting doesn’t end when the symptom disappears. For maintainable operations:
- Capture timeline
- When did it start?
- What changed right before? (upgrade, new operator, config change)
- Document evidence
- Commands used, key outputs.
- Screenshots or exported logs as needed.
- Alerts that were firing and their resolution.
- Identify root cause
- Misconfiguration?
- Capacity limit?
- Platform bug?
- External system failure (storage, DNS, load balancer)?
- Define preventive actions
- Hardening configuration (quotas, limits, health checks).
- Adding or tuning alerts.
- Automating checks in CI/CD or cluster policies.
- Share knowledge
- Internal runbooks or knowledge base entries.
- Post-incident review with your team.
Building this habit improves overall cluster reliability and shortens future troubleshooting sessions.
When and How to Escalate
Sometimes, issues require vendor support or a specialized SRE/operations team.
Before escalating:
- Run
oc adm must-gatherand save output. - Capture:
oc get clusteroperatorsoc get nodesoc get co -o yamlandoc get nodes -o wideif practical.- Logs from affected namespaces (e.g. storage, ingress, etcd).
- Write a brief summary:
- Symptoms
- Scope (which namespaces/nodes/users)
- Recent changes
- Steps already tried and their results
This structured context helps others quickly understand the situation and significantly reduces time to resolution.
By combining these tools, patterns, and practices, you build a repeatable approach to diagnosing and resolving most OpenShift cluster issues encountered in day-to-day operations.