Kahibaro
Discord Login Register

Troubleshooting applications

Key Principles of Application Troubleshooting on OpenShift

Troubleshooting applications on OpenShift builds on the platform’s monitoring, logging, and observability features, but focuses specifically on identifying, isolating, and resolving issues at the application level, not on cluster-wide health.

This chapter assumes you already know:

Here the focus is: how to use those tools and other primitives to debug misbehaving applications in a structured way.

A Structured Troubleshooting Workflow

A typical troubleshooting flow for an application running on OpenShift:

  1. Detect the symptom
    • Error in UI, failing API call, degraded performance, pod crash, etc.
  2. Check high-level status
    • Are the pods up? Is the route responding? Are there alerts firing?
  3. Narrow down the layer
    • Is it a code bug, configuration issue, dependency problem, or platform issue?
  4. Deep dive with the right tool
    • Logs, events, metrics, traces, live pod inspection, or local reproduction.
  5. Verify the fix
    • Roll out a change (config, image, scaling) and confirm via metrics and logs.

The rest of the chapter is organized around these steps and the tools you can use.

Using the Web Console for First-Level Diagnosis

For many issues, the fastest path is the web console. Focus on application-level resources: projects, workloads, pods, and routes.

Checking Workload and Pod Health

From the console:

Common conditions and what they suggest (at an application level):

Inspecting Route and Service Reachability

If the symptom is “my app is not reachable”:

  1. In Topology, select the application.
  2. Check Routes section:
    • Confirm a route exists and has a host.
    • Use the “Open URL” link.
  3. If it fails:
    • Look at Events for any service/endpoint issues.
    • Check Pods → ensure endpoints are “Ready” (readiness probe passing).

Application-specific interpretations:

Using `oc` for Troubleshooting

The CLI gives you more detail and is better for systematic troubleshooting and automation.

Inspecting Pod and Container Status

Key commands:

In oc describe pod, focus on:

Use label selectors to narrow scope:

Accessing Pod Logs

Logs are often the most direct window into what is going wrong in the application.

Basic log access:

For real-time issues:

Application-oriented tips:

If your logs are structured (JSON, key/value), keep that in mind when interpreting them; centralized logging will rely on consistent structure for search.

Interactive Debugging in a Running Pod

Sometimes you need to inspect the runtime environment from inside the pod.

Using `oc rsh` and `oc exec`

Common checks:

If the container image does not include debug tools, you may need to use other patterns (see “Ephemeral debug containers” below).

Ephemeral Debug Containers

OpenShift supports starting a temporary “debug” container attached to an existing pod to inspect its environment without modifying the original image.

Example:

oc debug pod/<pod-name>

This usually:

Application use-cases:

Interpreting Common Application-Level Failure Modes

The same platform symptoms can be caused by many different application-level issues. The goal is to learn what to check next.

CrashLoopBackOff and Application Crashes

Symptom:

Debug steps:

  1. Check recent logs:
    • oc logs <pod-name> --previous
    • Look for:
      • Uncaught exceptions, stack traces.
      • “Cannot connect” errors to external services.
      • “File not found” for configuration or dependency files.
  2. Check configuration:
    • oc set env deploy/<deployment-name> --list
    • oc rsh or oc exec to verify config files exist.
  3. Correlate with recent changes:
    • New image version?
    • New config or secret?
    • New environment variable?

Typical root causes:

Readiness and Liveness Probe Failures

Symptoms:

Debug steps:

  1. oc describe pod <pod>:
    • Look at events: Readiness probe failed: ... or Liveness probe failed: ....
  2. From inside the pod (oc rsh):
    • Run the same endpoint checked by the probe:
      • curl http://localhost:<port>/health (adjust path/port).
    • Observe response code and body.
  3. Verify that:
    • The app actually listens on the port defined in the container spec.
    • The health endpoint is correctly implemented and fast.

Typical root causes:

Configuration and Secret Issues

Configuration problems often manifest as:

Debug steps:

  1. Inspect environment at the workload level:
    • oc set env deploy/<name> --list
  2. Inspect ConfigMaps and Secrets:
    • oc get configmap / oc get secret
    • oc describe configmap <name>
  3. Check mounts and environment from inside the pod:
    • oc rsh <pod>
    • Confirm:
      • Files are present at the expected paths.
      • Values are what the app expects (strings vs JSON, paths, URL formats).

Typical root causes:

Dependency and Network-Related Application Issues

Applications often depend on databases, message queues, or external APIs. Troubleshooting must distinguish “my app is broken” from “my app cannot reach its dependency.”

Debug steps:

  1. From the pod:
    • oc rsh <pod>
    • Try to reach the service:
      • curl http://service-name:port
      • nc -zv service-name port (if tooling is available).
  2. Check service and endpoints:
    • oc get svc
    • oc describe svc <service-name>
    • oc get endpoints <service-name>
      • Ensure there are ready endpoints.
  3. Look at application logs:
    • Connection timeout errors vs authentication errors vs DNS resolution errors.

Typical root causes:

Using Metrics and Dashboards to Troubleshoot Performance and Reliability

For performance, latency, or resource issues, logs alone are insufficient. Use the built-in monitoring stack to identify where the bottleneck is.

Identifying Resource Constraints

From the console (Developer or Administrator perspective):

Application signs of resource issues:

At the CLI:

Correlating Metrics with Application Events

Effective troubleshooting often relies on correlating:

Use metrics to:

If tracing is used, you can additionally:

Debugging Rollouts and Version-Specific Issues

In OpenShift, application changes are typically introduced through Deployments or DeploymentConfigs. Many problems occur only after a new version is rolled out.

Comparing Old and New Revisions

Key steps:

  1. List rollouts:
    • For Deployments:
      • oc rollout history deploy/<name>
  2. Inspect a specific revision:
    • oc rollout history deploy/<name> --revision=<n>
  3. Compare:
    • Image version (tag, digest).
    • Environment variables.
    • ConfigMap/Secret versions referenced.
    • Resource limits.

If the previous version worked:

Canary and Blue-Green Considerations

In more advanced setups (multiple versions running concurrently):

Application-Centric Use of Centralized Logging

Centralized logging (discussed elsewhere) becomes more powerful with good application log practices. From a troubleshooting point of view, focus on:

Example queries in a logging UI might include:

Application-level tips:

Practical Debugging Scenarios

To consolidate the concepts, here are representative patterns and targeted actions.

Scenario 1: Application Not Starting After New Deploy

Symptoms:

Actions:

  1. oc get pods (identify new pods).
  2. oc logs --previous <pod> (capture startup logs).
  3. Compare environment and image between working and broken revisions:
    • oc rollout history deploy/myapp
    • oc describe deploy myapp
  4. If config-related:
    • Fix ConfigMap/Secret or env vars.
    • Redeploy and verify.

Scenario 2: Application Suddenly Slow Under Load

Symptoms:

Actions:

  1. Check metrics for:
    • CPU and memory usage of pods.
    • Request rate and error rate.
  2. Look for:
    • CPU throttling (near or at limit).
    • OOMKilled restarts.
  3. Evaluate scaling:
    • Is Horizontal Pod Autoscaling in place and working?
  4. Use tracing (if available) or logs to see which operations are slow.
  5. Adjust:
    • Resource requests/limits.
    • Application code / queries.
    • Autoscaling thresholds.

Scenario 3: Some Requests Fail, Others Succeed

Symptoms:

Actions:

  1. Check pod readiness and restarts:
    • oc get pods
    • oc describe pod <pod>
  2. Examine logs for specific pods with errors around the same time.
  3. Check if:
    • Only specific pods misbehave (e.g., one pod misconfigured, others fine).
    • Errors correlate with particular routes/paths.
  4. Drain or delete problematic pod(s) and observe:
    • If problem disappears, inspect configuration differences.

Building Troubleshooting into Your Development Process

The most effective troubleshooting is prepared:

In OpenShift, this means:

By combining logs, metrics, traces, and the runtime inspection tools shown here, you can methodically track down most application-level issues on OpenShift.

Views: 20

Comments

Please login to add a comment.

Don't have an account? Register now!