Troubleshooting applications

Table of Contents

Key Principles of Application Troubleshooting on OpenShift

Troubleshooting applications on OpenShift builds on the platform’s monitoring, logging, and observability features, but focuses specifically on identifying, isolating, and resolving issues at the application level, not on cluster-wide health.

This chapter assumes you already know:

How metrics, logs, and traces are collected in OpenShift.
Where to view dashboards and logs in the console.

Here the focus is: how to use those tools and other primitives to debug misbehaving applications in a structured way.

A Structured Troubleshooting Workflow

A typical troubleshooting flow for an application running on OpenShift:

Detect the symptom

Error in UI, failing API call, degraded performance, pod crash, etc.

Check high-level status

Are the pods up? Is the route responding? Are there alerts firing?

Narrow down the layer

Is it a code bug, configuration issue, dependency problem, or platform issue?

Deep dive with the right tool

Logs, events, metrics, traces, live pod inspection, or local reproduction.

Verify the fix

Roll out a change (config, image, scaling) and confirm via metrics and logs.

The rest of the chapter is organized around these steps and the tools you can use.

Using the Web Console for First-Level Diagnosis

For many issues, the fastest path is the web console. Focus on application-level resources: projects, workloads, pods, and routes.

Checking Workload and Pod Health

From the console:

Go to the Developer perspective → Topology:

Look at your application icon (Deployment, DeploymentConfig, etc.).
Check:

Are all expected pods running?
Are there warning / error badges?

Click the workload, then:

Resources tab: see the pods, their status (Running, CrashLoopBackOff, ImagePullBackOff, Pending, etc.).
Events tab: cluster events related to this workload (failed pulls, scheduling issues, probe failures).

Common conditions and what they suggest (at an application level):

CrashLoopBackOff

The container starts but exits repeatedly.
Typically caused by application startup failure, configuration errors, missing dependencies, or migration scripts failing.

Error or Completed when you expect Running

Often batch jobs or sidecar containers finishing unexpectedly.

Pending (for a long time)

Often not an application bug per se; usually insufficient resources, node selectors, or scheduling constraints.

Inspecting Route and Service Reachability

If the symptom is “my app is not reachable”:

In Topology, select the application.
Check Routes section:

Confirm a route exists and has a host.
Use the “Open URL” link.

If it fails:

Look at Events for any service/endpoint issues.
Check Pods → ensure endpoints are “Ready” (readiness probe passing).

Application-specific interpretations:

If route works intermittently:

Some pods may be failing readiness or responding slowly.

If route returns HTTP 5xx:

Application error within the pod, often visible in logs.

If route returns 404:

Path/mapping or application routing configuration issue.

Using `oc` for Troubleshooting

The CLI gives you more detail and is better for systematic troubleshooting and automation.

Inspecting Pod and Container Status

Key commands:

List pods in a project:

oc get pods

Describe a pod:

oc describe pod <pod-name>

In oc describe pod, focus on:

Conditions (Ready, ContainersReady).
Events section:

Back-off restarting failed container → application crashing during startup.
Readiness probe failed / Liveness probe failed → application not responding properly.
FailedMount, FailedScheduling → may reflect dependencies (e.g., storage, node placement) required by the app.

Use label selectors to narrow scope:

oc get pods -l app=myapp
oc describe deploy myapp

Accessing Pod Logs

Logs are often the most direct window into what is going wrong in the application.

Basic log access:

oc logs <pod-name>
oc logs <pod-name> -c <container-name> (for multi-container pods)

For real-time issues:

oc logs -f <pod-name> to follow logs.

Application-oriented tips:

If the pod crashes at startup:

Check logs without -f first; they may be short but contain a stack trace or configuration error.

For CrashLoopBackOff:

Check oc logs --previous <pod-name> to see logs from the previous failed container instance.

For multi-container pods:

Identify which container is the “application” vs sidecars (log collectors, proxies) and focus on that container.

If your logs are structured (JSON, key/value), keep that in mind when interpreting them; centralized logging will rely on consistent structure for search.

Interactive Debugging in a Running Pod

Sometimes you need to inspect the runtime environment from inside the pod.

Using `oc rsh` and `oc exec`

oc rsh <pod-name>

Open a remote shell into the default container.

oc exec -it <pod-name> -- /bin/bash

Execute a specific shell (or command) interactively.

Common checks:

Environment:

env to verify configuration variables.

Files:

Check mounted config files (from ConfigMaps, Secrets) in expected paths.

Network:

curl http://service-name:port/path to test connectivity to dependencies (databases, APIs) from the application pod’s viewpoint.

Permissions:

List directories, attempt file writes in configured directories.

If the container image does not include debug tools, you may need to use other patterns (see “Ephemeral debug containers” below).

Ephemeral Debug Containers

OpenShift supports starting a temporary “debug” container attached to an existing pod to inspect its environment without modifying the original image.

Example:

oc debug pod/<pod-name>

This usually:

Launches a new container based on a debug image.
Mounts the same volumes.
Lets you inspect configuration, volumes, and sometimes even processes (depending on configuration).

Application use-cases:

Validating that ConfigMaps/Secrets mounted into the original pod look correct.
Inspecting directories where the app writes data/logs.
Checking network reachability to dependencies from the pod’s network context.

Interpreting Common Application-Level Failure Modes

The same platform symptoms can be caused by many different application-level issues. The goal is to learn what to check next.

CrashLoopBackOff and Application Crashes

Symptom:

STATUS shows CrashLoopBackOff on oc get pods.

Debug steps:

Check recent logs:

oc logs <pod-name> --previous
Look for:

Uncaught exceptions, stack traces.
“Cannot connect” errors to external services.
“File not found” for configuration or dependency files.

Check configuration:

oc set env deploy/<deployment-name> --list
oc rsh or oc exec to verify config files exist.

Correlate with recent changes:

New image version?
New config or secret?
New environment variable?

Typical root causes:

Code bug introduced in new release.
Application crashes when a required env var is missing.
DB migrations failing at startup.
Port conflicts (application listening on a different port than the one defined in the container or probe).

Readiness and Liveness Probe Failures

Symptoms:

Pods stuck in NotReady.
Frequent restarts due to liveness probe failing.
Route returns intermittent 503 errors.

Debug steps:

oc describe pod <pod>:

Look at events: Readiness probe failed: ... or Liveness probe failed: ....

From inside the pod (oc rsh):

Run the same endpoint checked by the probe:

curl http://localhost:<port>/health (adjust path/port).

Observe response code and body.

Verify that:

The app actually listens on the port defined in the container spec.
The health endpoint is correctly implemented and fast.

Typical root causes:

Application takes longer to start than probe configuration allows (timeout/initialDelay too small).
Incorrect probe path or port.
Health check logic too strict (failing probe because of minor internal warnings).
Application-level resource contention causing slow responses.

Configuration and Secret Issues

Configuration problems often manifest as:

Startup failures.
Runtime errors when accessing external services.
Subtle behavior differences between environments.

Debug steps:

Inspect environment at the workload level:

oc set env deploy/<name> --list

Inspect ConfigMaps and Secrets:

oc get configmap / oc get secret
oc describe configmap <name>

Check mounts and environment from inside the pod:

oc rsh <pod>
Confirm:

Files are present at the expected paths.
Values are what the app expects (strings vs JSON, paths, URL formats).

Typical root causes:

Typo in environment variable names (application expects DB_HOST, you defined DBHOST).
Secrets mounted in a wrong path or as env vars instead of files (or vice versa).
Values formatted incorrectly (e.g., booleans or integers passed as strings without quotes in YAML).

Dependency and Network-Related Application Issues

Applications often depend on databases, message queues, or external APIs. Troubleshooting must distinguish “my app is broken” from “my app cannot reach its dependency.”

Debug steps:

From the pod:

oc rsh <pod>
Try to reach the service:

curl http://service-name:port
nc -zv service-name port (if tooling is available).

Check service and endpoints:

oc get svc
oc describe svc <service-name>
oc get endpoints <service-name>

Ensure there are ready endpoints.

Look at application logs:

Connection timeout errors vs authentication errors vs DNS resolution errors.

Typical root causes:

Service name/port mismatch in the application configuration.
Network policies blocking traffic between pods/namespaces.
Credentials misconfigured → authentication failures.

Using Metrics and Dashboards to Troubleshoot Performance and Reliability

For performance, latency, or resource issues, logs alone are insufficient. Use the built-in monitoring stack to identify where the bottleneck is.

Identifying Resource Constraints

From the console (Developer or Administrator perspective):

Open Monitoring → Metrics / Dashboards.
Inspect application workloads:

CPU usage vs CPU requests/limits.
Memory usage vs memory requests/limits.
Pod restarts correlated with OOM (Out-Of-Memory) kills or throttling.

Application signs of resource issues:

Frequent restarts with OOMKilled in pod status.
High latency under load with CPU throttling.
Garbage collector or runtime warnings in logs when memory is tight.

At the CLI:

oc describe pod <pod>:

Check Last State: Terminated (OOMKilled) and resource requests/limits.

Adjust application’s resource requests/limits through the deployment.

Correlating Metrics with Application Events

Effective troubleshooting often relies on correlating:

Deployments / releases.
Traffic spikes.
Error rates and response latency.

Use metrics to:

Confirm whether a spike in 5xx responses coincides with:

A new rollout.
CPU usage reaching limit.
Increased request rate.

Distinguish between:

Constantly high failure rate (config or code bug).
Failures only under high load (capacity or scaling issue).

If tracing is used, you can additionally:

Pinpoint which service or specific endpoint in a microservices chain is slow.
Identify external calls (e.g., to a third-party API) that dominate latency.

Debugging Rollouts and Version-Specific Issues

In OpenShift, application changes are typically introduced through Deployments or DeploymentConfigs. Many problems occur only after a new version is rolled out.

Comparing Old and New Revisions

Key steps:

List rollouts:

For Deployments:

oc rollout history deploy/<name>

Inspect a specific revision:

oc rollout history deploy/<name> --revision=<n>

Compare:

Image version (tag, digest).
Environment variables.
ConfigMap/Secret versions referenced.
Resource limits.

If the previous version worked:

Try rolling back:

oc rollout undo deploy/<name>

Confirm if the problem goes away:

If yes, root cause likely in changed code/configuration of the new revision, not in the platform.

Canary and Blue-Green Considerations

In more advanced setups (multiple versions running concurrently):

Ensure traffic is actually routed where you expect:

Check route configuration and any A/B or canary routing rules.

Use version-specific labels:

Filter metrics and logs by label (version/build) to see which version is failing.

Application-Centric Use of Centralized Logging

Centralized logging (discussed elsewhere) becomes more powerful with good application log practices. From a troubleshooting point of view, focus on:

Filtering logs by:

Namespace, app label, version, pod.

Searching for:

Error-level logs around the time of the problem.
Correlation IDs or request IDs (if your application propagates them).

Example queries in a logging UI might include:

All error logs for app myapp in the last 15 minutes.
All logs containing a particular user ID, transaction ID, or request path.

Application-level tips:

Emit meaningful error messages and contextual data at error boundaries.
Avoid logging sensitive data; rely on correlation IDs for tracing problems end-to-end.

Practical Debugging Scenarios

To consolidate the concepts, here are representative patterns and targeted actions.

Scenario 1: Application Not Starting After New Deploy

Symptoms:

New Deployment rolled out.
Pods in CrashLoopBackOff.

Actions:

oc get pods (identify new pods).
oc logs --previous <pod> (capture startup logs).
Compare environment and image between working and broken revisions:

oc rollout history deploy/myapp
oc describe deploy myapp

If config-related:

Fix ConfigMap/Secret or env vars.
Redeploy and verify.

Scenario 2: Application Suddenly Slow Under Load

Symptoms:

Increased response times.
Some requests timing out.
No recent deployments.

Actions:

Check metrics for:

CPU and memory usage of pods.
Request rate and error rate.

Look for:

CPU throttling (near or at limit).
OOMKilled restarts.

Evaluate scaling:

Is Horizontal Pod Autoscaling in place and working?

Use tracing (if available) or logs to see which operations are slow.
Adjust:

Resource requests/limits.
Application code / queries.
Autoscaling thresholds.

Scenario 3: Some Requests Fail, Others Succeed

Symptoms:

Intermittent 5xx or 503.
Only some users affected.

Actions:

Check pod readiness and restarts:

oc get pods
oc describe pod <pod>

Examine logs for specific pods with errors around the same time.
Check if:

Only specific pods misbehave (e.g., one pod misconfigured, others fine).
Errors correlate with particular routes/paths.

Drain or delete problematic pod(s) and observe:

If problem disappears, inspect configuration differences.

Building Troubleshooting into Your Development Process

The most effective troubleshooting is prepared:

Design applications with:

Clear and consistent logging (levels, structure, correlation IDs).
Health endpoints suitable for probes and manual checks.
Configurations externalized and observable (e.g., logging config at startup).

Include:

Synthetic tests and dashboards targeting key endpoints.
Alerts on critical error rates and latencies.

In OpenShift, this means:

Leveraging the platform’s logging and monitoring but making sure your application exposes enough meaningful signals.
Using oc and the console not only when something breaks, but also during development and testing, so the troubleshooting process becomes familiar and repeatable.

By combining logs, metrics, traces, and the runtime inspection tools shown here, you can methodically track down most application-level issues on OpenShift.

Comments

Please login to add a comment.

Don't have an account? Register now!

Troubleshooting applications

Key Principles of Application Troubleshooting on OpenShift

A Structured Troubleshooting Workflow

Using the Web Console for First-Level Diagnosis

Checking Workload and Pod Health

Inspecting Route and Service Reachability

Using `oc` for Troubleshooting

Inspecting Pod and Container Status

Accessing Pod Logs

Interactive Debugging in a Running Pod

Using `oc rsh` and `oc exec`

Ephemeral Debug Containers

Interpreting Common Application-Level Failure Modes

CrashLoopBackOff and Application Crashes

Readiness and Liveness Probe Failures

Configuration and Secret Issues

Dependency and Network-Related Application Issues

Using Metrics and Dashboards to Troubleshoot Performance and Reliability

Identifying Resource Constraints

Correlating Metrics with Application Events

Debugging Rollouts and Version-Specific Issues

Comparing Old and New Revisions

Canary and Blue-Green Considerations

Application-Centric Use of Centralized Logging

Practical Debugging Scenarios

Scenario 1: Application Not Starting After New Deploy

Scenario 2: Application Suddenly Slow Under Load

Scenario 3: Some Requests Fail, Others Succeed

Building Troubleshooting into Your Development Process

Comments

Where to Move