Kahibaro
Discord Login Register

Monitoring and troubleshooting report

Learning goals for this exercise

By the end of this project component, you should be able to:

This is not just “fix the problem”; it is “show how you detected it, investigated it, and confirmed the fix”, using OpenShift-native tools.

Requirements for the report

Your monitoring and troubleshooting report should cover:

  1. Context
    • Brief description of the application and how it is deployed:
      • Type of app (e.g., web API, frontend, batch worker).
      • Main OpenShift objects involved (e.g., Deployment, Route, Service, ConfigMap, Secret, PersistentVolumeClaim).
    • Environment details:
      • Project/namespace name.
      • Any relevant scaling or resource settings (e.g., requests/limits, HPA).
  2. Scenario definition
    • One or more concrete issues you will investigate. Examples:
      • Application is slow or intermittently unavailable.
      • Pods are restarting frequently (CrashLoopBackOff).
      • CPU or memory is saturating under load.
      • Errors in logs related to configuration, database, or external services.
    • Clearly state:
      • Symptoms (what the user or operator observes).
      • Suspected area (application, configuration, resource limits, networking, storage, etc.).
  3. Monitoring and observability setup
    • Which monitoring/observability tools you used and how:
      • Built-in OpenShift web console views (metrics, pod details, events).
      • CLI with oc for status and logs.
      • Any additional dashboards or queries (e.g., custom graphs, alerts).
    • Any minimal configuration you added to improve observability for this exercise, for example:
      • Enabling or adjusting application logging verbosity.
      • Adding simple application-level health endpoints and checking them via Route or Service.
    • You do not need to build an entire monitoring stack from scratch; rely on what the cluster provides and what you configured in earlier project steps.
  4. Data collection steps

Describe exactly how you collected evidence for your investigation, including representative commands and console views.

At minimum, include:

       oc get pods -n <project>
       oc describe pod <pod-name> -n <project>
       oc get deployment -n <project>
       oc get nodes
       oc get events -n <project>
       oc logs <pod-name> -c <container-name> -n <project>
       oc logs <pod-name> --previous -n <project>

For each step, capture:

  1. Troubleshooting workflow

Structure this as a sequence of hypotheses and tests:

  1. Initial observation
    • What first indicated that there was a problem?
    • Example: health checks failing, requests returning 500 status, pods in CrashLoopBackOff.
  2. First hypothesis and test
    • Hypothesis example: “The pod is OOMKilled due to low memory limit.”
    • Evidence you collected:
      • Pod describe showing OOMKilled in last state.
      • Metrics showing memory usage hitting the limit.
    • Conclusion: confirm or reject the hypothesis.
  3. Next hypotheses and tests
    • Continue until you converge on a likely root cause; examples:
      • Misconfigured environment variable (e.g., wrong database URL).
      • Missing permission (e.g., failure to access a persistent volume or external service).
      • Insufficient resource requests/limits causing throttling.
      • Application-level bug that surfaces under certain inputs.
  4. Decision points
    • Note moments where you changed direction because evidence did not match your expectations.
    • Example: “We suspected CPU saturation, but CPU metrics were low, so we shifted focus to network errors visible in logs.”

Keep focus on how you used OpenShift information (objects, events, logs, metrics) to drive your decisions.

  1. Resolution and validation

Describe the fix you applied, focusing on how it is expressed in OpenShift resources.

Common resolution actions for this project may include:

For each change:

     oc set env deployment/<name> VAR=value -n <project>
     oc apply -f deployment.yaml
     oc rollout status deployment/<name> -n <project>
  1. Summary and lessons learned

Conclude with:

Suggested structure and template

You can organize your report using the following template. Replace placeholders with your own content.

  1. Application and environment overview
    • Application name and purpose:
    • Namespace/project:
    • Key OpenShift resources involved:
  2. Problem statement
    • Observed symptoms:
    • Impact (who/what was affected, severity):
    • Time window of the incident (if applicable):
  3. Monitoring and data collection
    • Tools and views used (console pages, oc commands):
    • Snapshot of relevant status (oc get / oc describe):
    • Key metrics examined:
    • Key log snippets and events:
  4. Analysis and troubleshooting steps
    • Hypothesis 1, test method, outcome:
    • Hypothesis 2, test method, outcome:
    • Additional hypotheses as needed:
    • How you narrowed down to the root cause:
  5. Resolution
    • Changes made to OpenShift resources (configs, images, routes, resource limits, etc.):
    • Commands or manifests used:
    • Rollout/verification steps:
  6. Validation
    • Metrics before vs after fix (at a high level):
    • Log patterns before vs after:
    • Final system state (pods status, application behavior):
  7. Retrospective
    • Root cause summary:
    • Monitoring/observability improvements you would implement:
    • Troubleshooting techniques you found most valuable in OpenShift:

Evaluation criteria

When this part of the final project is assessed, the focus is on process and documentation, not just “having a working app”. Your report should demonstrate:

Use this exercise to practice operating your OpenShift-deployed application like a real service: monitor it, investigate issues systematically, and communicate your findings in a way that others can understand and build upon.

Views: 19

Comments

Please login to add a comment.

Don't have an account? Register now!