Kahibaro
Discord Login Register

Cluster troubleshooting

Typical Troubleshooting Mindset and Workflow

Troubleshooting an OpenShift cluster is less about memorizing every command and more about having a systematic approach. A practical workflow:

  1. Define the problem clearly
    • What is broken? (app, route, pod scheduling, cluster API, etc.)
    • Since when? After which change (upgrade, config, deployment)?
    • Is it localized (single project/node) or widespread (cluster-wide)?
  2. Check blast radius
    • One user or all users?
    • One namespace or all namespaces?
    • One node or multiple nodes / zones?
  3. Confirm symptoms with basic health checks
    • Can you oc login?
    • Is the API responsive?
    • Are core namespaces (e.g. openshift-apiserver, openshift-operator-lifecycle-manager) healthy?
  4. Narrow down the layer
    • Application layer (pods, Deployments, Routes)
    • Namespace / quota / RBAC issues
    • Cluster services (DNS, image registry, SDN, Ingress)
    • Control plane (API server, controllers, etcd)
    • Infrastructure (nodes, network, storage)
  5. Form a hypothesis, test, adjust
    • Make only one change at a time.
    • Log what you did and why (for handover and future reference).
  6. Know when to escalate
    • Gather diagnostics (logs, events, sosreport, must-gather) before escalating to SRE or vendor.

The rest of this chapter focuses on practical tools and patterns to support this workflow.

Core Tools for Troubleshooting

Basic `oc` diagnostics

Common starting commands:

Examples:

# Overall cluster condition
oc status
# Check all cluster operators
oc get clusteroperators
# Describe one operator in detail
oc describe clusteroperator authentication
# Check pod state across all namespaces
oc get pods -A
# Verify can you reach the API smoothly
oc get nodes

Using `oc describe` effectively

oc describe is often more useful than oc get when troubleshooting:

Example:

oc describe pod my-app-76cd49fdc6-8q4kt

Focus on:

`oc debug` for deep inspection

oc debug lets you:

Examples:

# Debug a faulty pod with its filesystem mounted
oc debug pod/my-app-76cd49fdc6-8q4kt
# Start a debug pod on a specific node
oc debug node/ip-10-0-1-23.ec2.internal
# In the debug session, you can inspect host paths like /host/etc, /host/var

Use oc debug node to inspect node-level issues without breaking the node’s configuration.

Collecting Logs

Application pod logs

Basic usage:

# Last logs from the main container
oc logs my-app-76cd49fdc6-8q4kt
# Logs from a named container in a pod
oc logs my-app-76cd49fdc6-8q4kt -c sidecar
# Stream logs
oc logs -f my-app-76cd49fdc6-8q4kt
# Previous container instance (useful in CrashLoopBackOff)
oc logs my-app-76cd49fdc6-8q4kt -p

OpenShift component logs

Most OpenShift platform components run as pods in system namespaces.

Examples:

# List pods in a system namespace
oc get pods -n openshift-apiserver
# Logs for a specific OpenShift component pod
oc logs -n openshift-apiserver apiserver-6f488978d6-f7nrn

Common namespaces to inspect:

Focus on Warning or Error lines and repeated failures.

Using `oc adm` and `must-gather`

For more systematic cluster diagnostics, especially before escalation or when there are widespread issues, OpenShift provides admin-level tools.

`oc adm top` and quick resource checks

Examples:

oc adm top nodes
oc adm top pods -A

Useful to identify:

`oc adm must-gather`

must-gather collects a comprehensive snapshot of cluster state and logs.

Basic usage with default image:

oc adm must-gather

This will:

You can specify a custom image (for vendor support or specialized collections) with --image. The output is a directory you can compress and share with support or keep for offline analysis.

Best practices:

Identifying and Working with Common Failure Patterns

Pods stuck in Pending

Typical root causes:

Diagnostics:

oc get pods
oc describe pod <pod-name>

Check in oc describe:

Remediation examples:

Pods in CrashLoopBackOff

Common causes:

Diagnostics:

  1. Check pod status and recent restarts:
   oc get pod <pod-name>
   oc describe pod <pod-name>
  1. Look at previous container logs:
   oc logs <pod-name> -p
  1. If needed, modify the container’s entrypoint or create a copy of the Deployment with a longer sleep to get an interactive shell.

Remediation:

Image pull failures

Symptoms:

Diagnostics with:

oc describe pod <pod-name>

Look for events such as:

Typical causes:

Checks:

  oc get secrets
  oc describe secret <pull-secret-name>

Networking and connectivity issues

High-level networking problems can manifest as:

Checking Service and endpoint health

For a given Service:

oc get svc my-service
oc describe svc my-service
oc get endpoints my-service

Confirm:

If endpoints are empty:

Testing connectivity from inside the cluster

Use a temporary debug pod:

oc run -it net-debug --image=registry.access.redhat.com/ubi9/ubi-minimal -- sh

Then test:

If DNS fails, inspect DNS-related components (e.g. openshift-dns pods).

Route and Ingress issues

For a Route:

oc get route my-route
oc describe route my-route

Focus on:

If the Route is not admitted, check Ingress Controller pods in openshift-ingress namespace.

From outside the cluster, verify:

Use curl -v from a client to see HTTP status codes and TLS errors.

Storage-related problems

Common observed states:

Diagnosing PVC and PV issues

Check PVC:

oc get pvc
oc describe pvc <pvc-name>

Look for:

Check corresponding PV (if statically provisioned or already created):

oc get pv
oc describe pv <pv-name>

Possible issues:

If pods are failing with mount errors, inspect node logs via oc debug node and check:

Control Plane and Cluster Operator Troubleshooting

Using ClusterOperators as health indicators

Cluster Operators express the health and upgrade status of major OpenShift components.

Basic inspection:

oc get clusteroperators

Columns to interpret:

Example pattern:

Deep-dive into a specific operator:

oc describe clusteroperator ingress

Look at:

Remediation often involves:

API server and etcd issues

Symptoms:

Investigations:

  oc get pods -n openshift-kube-apiserver
  oc logs -n openshift-kube-apiserver <pod-name>
  oc get pods -n openshift-etcd
  oc logs -n openshift-etcd <pod-name>

Watch for:

These issues are sensitive; for major problems, collect must-gather and follow vendor-specific runbooks rather than making manual changes to etcd.

Node-Level Troubleshooting

Cluster nodes can impact scheduling, pod performance, and overall availability.

Detecting node issues

Quick overview:

oc get nodes

Node Status may show:

For a specific node:

oc describe node <node-name>

Check:

If a node is NotReady:

Inspecting a node with `oc debug`

Instead of direct SSH, use:

oc debug node/<node-name>

Inside the debug session:

Use oc debug to avoid manual host modifications and respect platform practices.

Draining and cordoning nodes

When a node is unhealthy or needs maintenance, coordinate pod movement:

  oc adm cordon <node-name>
  oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data
  oc adm uncordon <node-name>

Draining is covered more thoroughly in node maintenance; from a troubleshooting perspective, use it to isolate problematic nodes and verify whether issues follow a specific node or not.

Troubleshooting the OpenShift Web Console

Typical symptoms:

Key components:

Checks:

oc get pods -n openshift-console
oc logs -n openshift-console <console-pod>
oc get routes -n openshift-console
oc describe route console -n openshift-console

Verify:

If console cannot reach API (but oc works), look for:

Using Monitoring and Logging for Troubleshooting

The monitoring and logging stack is covered elsewhere in detail; here the focus is on how to use it during incident response.

Quick checks with built-in monitoring

Within the Admin console (if accessible):

From CLI, you can:

Logging considerations

Depending on how cluster logging is configured (e.g. via the OpenShift Logging Operator), you might have:

During troubleshooting:

If centralized logging is not available, rely on oc logs and node-level inspection via oc debug.

Systematic Root Cause Analysis and Documentation

Troubleshooting doesn’t end when the symptom disappears. For maintainable operations:

  1. Capture timeline
    • When did it start?
    • What changed right before? (upgrade, new operator, config change)
  2. Document evidence
    • Commands used, key outputs.
    • Screenshots or exported logs as needed.
    • Alerts that were firing and their resolution.
  3. Identify root cause
    • Misconfiguration?
    • Capacity limit?
    • Platform bug?
    • External system failure (storage, DNS, load balancer)?
  4. Define preventive actions
    • Hardening configuration (quotas, limits, health checks).
    • Adding or tuning alerts.
    • Automating checks in CI/CD or cluster policies.
  5. Share knowledge
    • Internal runbooks or knowledge base entries.
    • Post-incident review with your team.

Building this habit improves overall cluster reliability and shortens future troubleshooting sessions.

When and How to Escalate

Sometimes, issues require vendor support or a specialized SRE/operations team.

Before escalating:

This structured context helps others quickly understand the situation and significantly reduces time to resolution.

By combining these tools, patterns, and practices, you build a repeatable approach to diagnosing and resolving most OpenShift cluster issues encountered in day-to-day operations.

Views: 12

Comments

Please login to add a comment.

Don't have an account? Register now!