Table of Contents
Key Principles of Application Troubleshooting on OpenShift
Troubleshooting applications on OpenShift builds on the platform’s monitoring, logging, and observability features, but focuses specifically on identifying, isolating, and resolving issues at the application level, not on cluster-wide health.
This chapter assumes you already know:
- How metrics, logs, and traces are collected in OpenShift.
- Where to view dashboards and logs in the console.
Here the focus is: how to use those tools and other primitives to debug misbehaving applications in a structured way.
A Structured Troubleshooting Workflow
A typical troubleshooting flow for an application running on OpenShift:
- Detect the symptom
- Error in UI, failing API call, degraded performance, pod crash, etc.
- Check high-level status
- Are the pods up? Is the route responding? Are there alerts firing?
- Narrow down the layer
- Is it a code bug, configuration issue, dependency problem, or platform issue?
- Deep dive with the right tool
- Logs, events, metrics, traces, live pod inspection, or local reproduction.
- Verify the fix
- Roll out a change (config, image, scaling) and confirm via metrics and logs.
The rest of the chapter is organized around these steps and the tools you can use.
Using the Web Console for First-Level Diagnosis
For many issues, the fastest path is the web console. Focus on application-level resources: projects, workloads, pods, and routes.
Checking Workload and Pod Health
From the console:
- Go to the Developer perspective → Topology:
- Look at your application icon (Deployment, DeploymentConfig, etc.).
- Check:
- Are all expected pods running?
- Are there warning / error badges?
- Click the workload, then:
- Resources tab: see the pods, their status (Running, CrashLoopBackOff, ImagePullBackOff, Pending, etc.).
- Events tab: cluster events related to this workload (failed pulls, scheduling issues, probe failures).
Common conditions and what they suggest (at an application level):
CrashLoopBackOff- The container starts but exits repeatedly.
- Typically caused by application startup failure, configuration errors, missing dependencies, or migration scripts failing.
ErrororCompletedwhen you expect Running- Often batch jobs or sidecar containers finishing unexpectedly.
Pending(for a long time)- Often not an application bug per se; usually insufficient resources, node selectors, or scheduling constraints.
Inspecting Route and Service Reachability
If the symptom is “my app is not reachable”:
- In Topology, select the application.
- Check Routes section:
- Confirm a route exists and has a host.
- Use the “Open URL” link.
- If it fails:
- Look at Events for any service/endpoint issues.
- Check Pods → ensure endpoints are “Ready” (readiness probe passing).
Application-specific interpretations:
- If route works intermittently:
- Some pods may be failing readiness or responding slowly.
- If route returns HTTP 5xx:
- Application error within the pod, often visible in logs.
- If route returns 404:
- Path/mapping or application routing configuration issue.
Using `oc` for Troubleshooting
The CLI gives you more detail and is better for systematic troubleshooting and automation.
Inspecting Pod and Container Status
Key commands:
- List pods in a project:
oc get pods- Describe a pod:
oc describe pod <pod-name>
In oc describe pod, focus on:
- Conditions (
Ready,ContainersReady). - Events section:
Back-off restarting failed container→ application crashing during startup.Readiness probe failed/Liveness probe failed→ application not responding properly.FailedMount,FailedScheduling→ may reflect dependencies (e.g., storage, node placement) required by the app.
Use label selectors to narrow scope:
oc get pods -l app=myappoc describe deploy myapp
Accessing Pod Logs
Logs are often the most direct window into what is going wrong in the application.
Basic log access:
oc logs <pod-name>oc logs <pod-name> -c <container-name>(for multi-container pods)
For real-time issues:
oc logs -f <pod-name>to follow logs.
Application-oriented tips:
- If the pod crashes at startup:
- Check logs without
-ffirst; they may be short but contain a stack trace or configuration error. - For CrashLoopBackOff:
- Check
oc logs --previous <pod-name>to see logs from the previous failed container instance. - For multi-container pods:
- Identify which container is the “application” vs sidecars (log collectors, proxies) and focus on that container.
If your logs are structured (JSON, key/value), keep that in mind when interpreting them; centralized logging will rely on consistent structure for search.
Interactive Debugging in a Running Pod
Sometimes you need to inspect the runtime environment from inside the pod.
Using `oc rsh` and `oc exec`
oc rsh <pod-name>- Open a remote shell into the default container.
oc exec -it <pod-name> -- /bin/bash- Execute a specific shell (or command) interactively.
Common checks:
- Environment:
envto verify configuration variables.- Files:
- Check mounted config files (from ConfigMaps, Secrets) in expected paths.
- Network:
curl http://service-name:port/pathto test connectivity to dependencies (databases, APIs) from the application pod’s viewpoint.- Permissions:
- List directories, attempt file writes in configured directories.
If the container image does not include debug tools, you may need to use other patterns (see “Ephemeral debug containers” below).
Ephemeral Debug Containers
OpenShift supports starting a temporary “debug” container attached to an existing pod to inspect its environment without modifying the original image.
Example:
oc debug pod/<pod-name>This usually:
- Launches a new container based on a debug image.
- Mounts the same volumes.
- Lets you inspect configuration, volumes, and sometimes even processes (depending on configuration).
Application use-cases:
- Validating that ConfigMaps/Secrets mounted into the original pod look correct.
- Inspecting directories where the app writes data/logs.
- Checking network reachability to dependencies from the pod’s network context.
Interpreting Common Application-Level Failure Modes
The same platform symptoms can be caused by many different application-level issues. The goal is to learn what to check next.
CrashLoopBackOff and Application Crashes
Symptom:
STATUSshowsCrashLoopBackOffonoc get pods.
Debug steps:
- Check recent logs:
oc logs <pod-name> --previous- Look for:
- Uncaught exceptions, stack traces.
- “Cannot connect” errors to external services.
- “File not found” for configuration or dependency files.
- Check configuration:
oc set env deploy/<deployment-name> --listoc rshoroc execto verify config files exist.- Correlate with recent changes:
- New image version?
- New config or secret?
- New environment variable?
Typical root causes:
- Code bug introduced in new release.
- Application crashes when a required env var is missing.
- DB migrations failing at startup.
- Port conflicts (application listening on a different port than the one defined in the container or probe).
Readiness and Liveness Probe Failures
Symptoms:
- Pods stuck in
NotReady. - Frequent restarts due to liveness probe failing.
- Route returns intermittent 503 errors.
Debug steps:
oc describe pod <pod>:- Look at events:
Readiness probe failed: ...orLiveness probe failed: .... - From inside the pod (
oc rsh): - Run the same endpoint checked by the probe:
curl http://localhost:<port>/health(adjust path/port).- Observe response code and body.
- Verify that:
- The app actually listens on the port defined in the container spec.
- The health endpoint is correctly implemented and fast.
Typical root causes:
- Application takes longer to start than probe configuration allows (timeout/initialDelay too small).
- Incorrect probe path or port.
- Health check logic too strict (failing probe because of minor internal warnings).
- Application-level resource contention causing slow responses.
Configuration and Secret Issues
Configuration problems often manifest as:
- Startup failures.
- Runtime errors when accessing external services.
- Subtle behavior differences between environments.
Debug steps:
- Inspect environment at the workload level:
oc set env deploy/<name> --list- Inspect ConfigMaps and Secrets:
oc get configmap/oc get secretoc describe configmap <name>- Check mounts and environment from inside the pod:
oc rsh <pod>- Confirm:
- Files are present at the expected paths.
- Values are what the app expects (strings vs JSON, paths, URL formats).
Typical root causes:
- Typo in environment variable names (application expects
DB_HOST, you definedDBHOST). - Secrets mounted in a wrong path or as env vars instead of files (or vice versa).
- Values formatted incorrectly (e.g., booleans or integers passed as strings without quotes in YAML).
Dependency and Network-Related Application Issues
Applications often depend on databases, message queues, or external APIs. Troubleshooting must distinguish “my app is broken” from “my app cannot reach its dependency.”
Debug steps:
- From the pod:
oc rsh <pod>- Try to reach the service:
curl http://service-name:portnc -zv service-name port(if tooling is available).- Check service and endpoints:
oc get svcoc describe svc <service-name>oc get endpoints <service-name>- Ensure there are ready endpoints.
- Look at application logs:
- Connection timeout errors vs authentication errors vs DNS resolution errors.
Typical root causes:
- Service name/port mismatch in the application configuration.
- Network policies blocking traffic between pods/namespaces.
- Credentials misconfigured → authentication failures.
Using Metrics and Dashboards to Troubleshoot Performance and Reliability
For performance, latency, or resource issues, logs alone are insufficient. Use the built-in monitoring stack to identify where the bottleneck is.
Identifying Resource Constraints
From the console (Developer or Administrator perspective):
- Open Monitoring → Metrics / Dashboards.
- Inspect application workloads:
- CPU usage vs CPU requests/limits.
- Memory usage vs memory requests/limits.
- Pod restarts correlated with OOM (Out-Of-Memory) kills or throttling.
Application signs of resource issues:
- Frequent restarts with
OOMKilledin pod status. - High latency under load with CPU throttling.
- Garbage collector or runtime warnings in logs when memory is tight.
At the CLI:
oc describe pod <pod>:- Check
Last State: Terminated(OOMKilled) and resource requests/limits. - Adjust application’s resource requests/limits through the deployment.
Correlating Metrics with Application Events
Effective troubleshooting often relies on correlating:
- Deployments / releases.
- Traffic spikes.
- Error rates and response latency.
Use metrics to:
- Confirm whether a spike in 5xx responses coincides with:
- A new rollout.
- CPU usage reaching limit.
- Increased request rate.
- Distinguish between:
- Constantly high failure rate (config or code bug).
- Failures only under high load (capacity or scaling issue).
If tracing is used, you can additionally:
- Pinpoint which service or specific endpoint in a microservices chain is slow.
- Identify external calls (e.g., to a third-party API) that dominate latency.
Debugging Rollouts and Version-Specific Issues
In OpenShift, application changes are typically introduced through Deployments or DeploymentConfigs. Many problems occur only after a new version is rolled out.
Comparing Old and New Revisions
Key steps:
- List rollouts:
- For Deployments:
oc rollout history deploy/<name>- Inspect a specific revision:
oc rollout history deploy/<name> --revision=<n>- Compare:
- Image version (tag, digest).
- Environment variables.
- ConfigMap/Secret versions referenced.
- Resource limits.
If the previous version worked:
- Try rolling back:
oc rollout undo deploy/<name>- Confirm if the problem goes away:
- If yes, root cause likely in changed code/configuration of the new revision, not in the platform.
Canary and Blue-Green Considerations
In more advanced setups (multiple versions running concurrently):
- Ensure traffic is actually routed where you expect:
- Check route configuration and any A/B or canary routing rules.
- Use version-specific labels:
- Filter metrics and logs by label (version/build) to see which version is failing.
Application-Centric Use of Centralized Logging
Centralized logging (discussed elsewhere) becomes more powerful with good application log practices. From a troubleshooting point of view, focus on:
- Filtering logs by:
- Namespace, app label, version, pod.
- Searching for:
- Error-level logs around the time of the problem.
- Correlation IDs or request IDs (if your application propagates them).
Example queries in a logging UI might include:
- All error logs for app
myappin the last 15 minutes. - All logs containing a particular user ID, transaction ID, or request path.
Application-level tips:
- Emit meaningful error messages and contextual data at error boundaries.
- Avoid logging sensitive data; rely on correlation IDs for tracing problems end-to-end.
Practical Debugging Scenarios
To consolidate the concepts, here are representative patterns and targeted actions.
Scenario 1: Application Not Starting After New Deploy
Symptoms:
- New Deployment rolled out.
- Pods in CrashLoopBackOff.
Actions:
oc get pods(identify new pods).oc logs --previous <pod>(capture startup logs).- Compare environment and image between working and broken revisions:
oc rollout history deploy/myappoc describe deploy myapp- If config-related:
- Fix ConfigMap/Secret or env vars.
- Redeploy and verify.
Scenario 2: Application Suddenly Slow Under Load
Symptoms:
- Increased response times.
- Some requests timing out.
- No recent deployments.
Actions:
- Check metrics for:
- CPU and memory usage of pods.
- Request rate and error rate.
- Look for:
- CPU throttling (near or at limit).
- OOMKilled restarts.
- Evaluate scaling:
- Is Horizontal Pod Autoscaling in place and working?
- Use tracing (if available) or logs to see which operations are slow.
- Adjust:
- Resource requests/limits.
- Application code / queries.
- Autoscaling thresholds.
Scenario 3: Some Requests Fail, Others Succeed
Symptoms:
- Intermittent 5xx or 503.
- Only some users affected.
Actions:
- Check pod readiness and restarts:
oc get podsoc describe pod <pod>- Examine logs for specific pods with errors around the same time.
- Check if:
- Only specific pods misbehave (e.g., one pod misconfigured, others fine).
- Errors correlate with particular routes/paths.
- Drain or delete problematic pod(s) and observe:
- If problem disappears, inspect configuration differences.
Building Troubleshooting into Your Development Process
The most effective troubleshooting is prepared:
- Design applications with:
- Clear and consistent logging (levels, structure, correlation IDs).
- Health endpoints suitable for probes and manual checks.
- Configurations externalized and observable (e.g., logging config at startup).
- Include:
- Synthetic tests and dashboards targeting key endpoints.
- Alerts on critical error rates and latencies.
In OpenShift, this means:
- Leveraging the platform’s logging and monitoring but making sure your application exposes enough meaningful signals.
- Using
ocand the console not only when something breaks, but also during development and testing, so the troubleshooting process becomes familiar and repeatable.
By combining logs, metrics, traces, and the runtime inspection tools shown here, you can methodically track down most application-level issues on OpenShift.