20.4 Monitoring and troubleshooting report

Table of Contents

Learning goals for this exercise

By the end of this project component, you should be able to:

Design a focused monitoring and troubleshooting scenario around an application you deployed earlier in the course.
Use OpenShift’s monitoring and logging tools to collect relevant evidence.
Perform a structured troubleshooting workflow on a real issue (or a simulated one).
Document your findings in a clear, reproducible report suitable for sharing with a team.

This is not just “fix the problem”; it is “show how you detected it, investigated it, and confirmed the fix”, using OpenShift-native tools.

Requirements for the report

Your monitoring and troubleshooting report should cover:

Context

Brief description of the application and how it is deployed:

Type of app (e.g., web API, frontend, batch worker).
Main OpenShift objects involved (e.g., Deployment, Route, Service, ConfigMap, Secret, PersistentVolumeClaim).

Environment details:

Project/namespace name.
Any relevant scaling or resource settings (e.g., requests/limits, HPA).

Scenario definition

One or more concrete issues you will investigate. Examples:

Application is slow or intermittently unavailable.
Pods are restarting frequently (CrashLoopBackOff).
CPU or memory is saturating under load.
Errors in logs related to configuration, database, or external services.

Clearly state:

Symptoms (what the user or operator observes).
Suspected area (application, configuration, resource limits, networking, storage, etc.).

Monitoring and observability setup

Which monitoring/observability tools you used and how:

Built-in OpenShift web console views (metrics, pod details, events).
CLI with oc for status and logs.
Any additional dashboards or queries (e.g., custom graphs, alerts).

Any minimal configuration you added to improve observability for this exercise, for example:

Enabling or adjusting application logging verbosity.
Adding simple application-level health endpoints and checking them via Route or Service.

You do not need to build an entire monitoring stack from scratch; rely on what the cluster provides and what you configured in earlier project steps.

Data collection steps

Describe exactly how you collected evidence for your investigation, including representative commands and console views.

At minimum, include:

Resource and status inspection

How you checked pod and deployment status:

       oc get pods -n <project>
       oc describe pod <pod-name> -n <project>
       oc get deployment -n <project>

Node or cluster status if relevant:

       oc get nodes
       oc get events -n <project>

Logs

Commands or console actions to inspect logs:

       oc logs <pod-name> -c <container-name> -n <project>
       oc logs <pod-name> --previous -n <project>

If multiple replicas exist, how you ensured you saw logs from the relevant pod(s).

Metrics and performance

Which metrics you looked at, such as:

CPU and memory usage for pods or containers.
Request rate or error rate if visible.
Restart counts over time.

Whether you inspected metrics:

Via the web console’s Monitoring or Workloads views.
Or via CLI if applicable (e.g., oc adm top pods).

For each step, capture:

The command or console path.
A summary of what you observed (you may include short excerpts; no need to paste huge logs).

Troubleshooting workflow

Structure this as a sequence of hypotheses and tests:

Initial observation

What first indicated that there was a problem?
Example: health checks failing, requests returning 500 status, pods in CrashLoopBackOff.

First hypothesis and test

Hypothesis example: “The pod is OOMKilled due to low memory limit.”
Evidence you collected:

Pod describe showing OOMKilled in last state.
Metrics showing memory usage hitting the limit.

Conclusion: confirm or reject the hypothesis.

Next hypotheses and tests

Continue until you converge on a likely root cause; examples:

Misconfigured environment variable (e.g., wrong database URL).
Missing permission (e.g., failure to access a persistent volume or external service).
Insufficient resource requests/limits causing throttling.
Application-level bug that surfaces under certain inputs.

Decision points

Note moments where you changed direction because evidence did not match your expectations.
Example: “We suspected CPU saturation, but CPU metrics were low, so we shifted focus to network errors visible in logs.”

Keep focus on how you used OpenShift information (objects, events, logs, metrics) to drive your decisions.

Resolution and validation

Describe the fix you applied, focusing on how it is expressed in OpenShift resources.

Common resolution actions for this project may include:

Configuration changes

Updating a ConfigMap or Secret.
Changing environment variables in a Deployment manifest.
Rolling out a new image with a bug fix.

Resource tuning

Modifying resources.requests and resources.limits for CPU/memory.
Adjusting replica counts if under- or over-provisioned.

Networking or routing adjustments

Updating Route settings (e.g., TLS termination, path, target port).
Fixing Service labels or selectors so pods are correctly matched.

For each change:

Show the command or steps:

     oc set env deployment/<name> VAR=value -n <project>
     oc apply -f deployment.yaml
     oc rollout status deployment/<name> -n <project>

Explain how you validated the fix:

Which metrics improved (e.g., reduced 5xx errors, lower restart counts).
Which logs you checked and what changed.
Which user-facing symptoms disappeared (e.g., application reachable again, latency reduced).

Summary and lessons learned

Conclude with:

Root cause statement
A concise sentence: “The incident was caused by …”
Key evidence
List the top 3–5 data points that were most decisive:

A particular log line.
A metric spike.
A pod event or condition.

What worked well in your troubleshooting approach

Steps that quickly narrowed down the issue.
Tools or views that were especially useful.

What you would improve in the future

Monitoring gaps you discovered (e.g., missing alerts, insufficient logs).
Additional metrics or health checks you would implement for production readiness.
Any automation ideas (e.g., alert rules, runbooks).

Suggested structure and template

You can organize your report using the following template. Replace placeholders with your own content.

Application and environment overview

Application name and purpose:
Namespace/project:
Key OpenShift resources involved:

Problem statement

Observed symptoms:
Impact (who/what was affected, severity):
Time window of the incident (if applicable):

Monitoring and data collection

Tools and views used (console pages, oc commands):
Snapshot of relevant status (oc get / oc describe):
Key metrics examined:
Key log snippets and events:

Analysis and troubleshooting steps

Hypothesis 1, test method, outcome:
Hypothesis 2, test method, outcome:
Additional hypotheses as needed:
How you narrowed down to the root cause:

Resolution

Changes made to OpenShift resources (configs, images, routes, resource limits, etc.):
Commands or manifests used:
Rollout/verification steps:

Validation

Metrics before vs after fix (at a high level):
Log patterns before vs after:
Final system state (pods status, application behavior):

Retrospective

Root cause summary:
Monitoring/observability improvements you would implement:
Troubleshooting techniques you found most valuable in OpenShift:

Evaluation criteria

When this part of the final project is assessed, the focus is on process and documentation, not just “having a working app”. Your report should demonstrate:

Use of OpenShift-native information

You relied on logs, events, metrics, and object status from OpenShift, not only on local tools.

Clear troubleshooting reasoning

Your hypotheses follow logically from observed data.
You updated your approach when evidence contradicted your assumptions.

Reproducibility

Commands, console paths, and steps are specific enough that someone else could follow them.

Clarity and conciseness

You avoid unnecessary detail while preserving key evidence and conclusions.

Reflection

You identify improvements to monitoring and troubleshooting practices for future deployments.

Use this exercise to practice operating your OpenShift-deployed application like a real service: monitor it, investigate issues systematically, and communicate your findings in a way that others can understand and build upon.

Comments

Please login to add a comment.

Don't have an account? Register now!