Table of Contents
Learning goals for this exercise
By the end of this project component, you should be able to:
- Design a focused monitoring and troubleshooting scenario around an application you deployed earlier in the course.
- Use OpenShift’s monitoring and logging tools to collect relevant evidence.
- Perform a structured troubleshooting workflow on a real issue (or a simulated one).
- Document your findings in a clear, reproducible report suitable for sharing with a team.
This is not just “fix the problem”; it is “show how you detected it, investigated it, and confirmed the fix”, using OpenShift-native tools.
Requirements for the report
Your monitoring and troubleshooting report should cover:
- Context
- Brief description of the application and how it is deployed:
- Type of app (e.g., web API, frontend, batch worker).
- Main OpenShift objects involved (e.g.,
Deployment,Route,Service,ConfigMap,Secret,PersistentVolumeClaim). - Environment details:
- Project/namespace name.
- Any relevant scaling or resource settings (e.g.,
requests/limits, HPA). - Scenario definition
- One or more concrete issues you will investigate. Examples:
- Application is slow or intermittently unavailable.
- Pods are restarting frequently (
CrashLoopBackOff). - CPU or memory is saturating under load.
- Errors in logs related to configuration, database, or external services.
- Clearly state:
- Symptoms (what the user or operator observes).
- Suspected area (application, configuration, resource limits, networking, storage, etc.).
- Monitoring and observability setup
- Which monitoring/observability tools you used and how:
- Built-in OpenShift web console views (metrics, pod details, events).
- CLI with
ocfor status and logs. - Any additional dashboards or queries (e.g., custom graphs, alerts).
- Any minimal configuration you added to improve observability for this exercise, for example:
- Enabling or adjusting application logging verbosity.
- Adding simple application-level health endpoints and checking them via
RouteorService. - You do not need to build an entire monitoring stack from scratch; rely on what the cluster provides and what you configured in earlier project steps.
- Data collection steps
Describe exactly how you collected evidence for your investigation, including representative commands and console views.
At minimum, include:
- Resource and status inspection
- How you checked pod and deployment status:
oc get pods -n <project>
oc describe pod <pod-name> -n <project>
oc get deployment -n <project>- Node or cluster status if relevant:
oc get nodes
oc get events -n <project>- Logs
- Commands or console actions to inspect logs:
oc logs <pod-name> -c <container-name> -n <project>
oc logs <pod-name> --previous -n <project>- If multiple replicas exist, how you ensured you saw logs from the relevant pod(s).
- Metrics and performance
- Which metrics you looked at, such as:
- CPU and memory usage for pods or containers.
- Request rate or error rate if visible.
- Restart counts over time.
- Whether you inspected metrics:
- Via the web console’s Monitoring or Workloads views.
- Or via CLI if applicable (e.g.,
oc adm top pods).
For each step, capture:
- The command or console path.
- A summary of what you observed (you may include short excerpts; no need to paste huge logs).
- Troubleshooting workflow
Structure this as a sequence of hypotheses and tests:
- Initial observation
- What first indicated that there was a problem?
- Example: health checks failing, requests returning 500 status, pods in
CrashLoopBackOff. - First hypothesis and test
- Hypothesis example: “The pod is OOMKilled due to low memory limit.”
- Evidence you collected:
- Pod
describeshowingOOMKilledin last state. - Metrics showing memory usage hitting the limit.
- Conclusion: confirm or reject the hypothesis.
- Next hypotheses and tests
- Continue until you converge on a likely root cause; examples:
- Misconfigured environment variable (e.g., wrong database URL).
- Missing permission (e.g., failure to access a persistent volume or external service).
- Insufficient resource requests/limits causing throttling.
- Application-level bug that surfaces under certain inputs.
- Decision points
- Note moments where you changed direction because evidence did not match your expectations.
- Example: “We suspected CPU saturation, but CPU metrics were low, so we shifted focus to network errors visible in logs.”
Keep focus on how you used OpenShift information (objects, events, logs, metrics) to drive your decisions.
- Resolution and validation
Describe the fix you applied, focusing on how it is expressed in OpenShift resources.
Common resolution actions for this project may include:
- Configuration changes
- Updating a
ConfigMaporSecret. - Changing environment variables in a
Deploymentmanifest. - Rolling out a new image with a bug fix.
- Resource tuning
- Modifying
resources.requestsandresources.limitsfor CPU/memory. - Adjusting replica counts if under- or over-provisioned.
- Networking or routing adjustments
- Updating
Routesettings (e.g., TLS termination, path, target port). - Fixing
Servicelabels or selectors so pods are correctly matched.
For each change:
- Show the command or steps:
oc set env deployment/<name> VAR=value -n <project>
oc apply -f deployment.yaml
oc rollout status deployment/<name> -n <project>- Explain how you validated the fix:
- Which metrics improved (e.g., reduced 5xx errors, lower restart counts).
- Which logs you checked and what changed.
- Which user-facing symptoms disappeared (e.g., application reachable again, latency reduced).
- Summary and lessons learned
Conclude with:
- Root cause statement
A concise sentence: “The incident was caused by …” - Key evidence
List the top 3–5 data points that were most decisive: - A particular log line.
- A metric spike.
- A pod event or condition.
- What worked well in your troubleshooting approach
- Steps that quickly narrowed down the issue.
- Tools or views that were especially useful.
- What you would improve in the future
- Monitoring gaps you discovered (e.g., missing alerts, insufficient logs).
- Additional metrics or health checks you would implement for production readiness.
- Any automation ideas (e.g., alert rules, runbooks).
Suggested structure and template
You can organize your report using the following template. Replace placeholders with your own content.
- Application and environment overview
- Application name and purpose:
- Namespace/project:
- Key OpenShift resources involved:
- Problem statement
- Observed symptoms:
- Impact (who/what was affected, severity):
- Time window of the incident (if applicable):
- Monitoring and data collection
- Tools and views used (console pages,
occommands): - Snapshot of relevant status (
oc get/oc describe): - Key metrics examined:
- Key log snippets and events:
- Analysis and troubleshooting steps
- Hypothesis 1, test method, outcome:
- Hypothesis 2, test method, outcome:
- Additional hypotheses as needed:
- How you narrowed down to the root cause:
- Resolution
- Changes made to OpenShift resources (configs, images, routes, resource limits, etc.):
- Commands or manifests used:
- Rollout/verification steps:
- Validation
- Metrics before vs after fix (at a high level):
- Log patterns before vs after:
- Final system state (pods status, application behavior):
- Retrospective
- Root cause summary:
- Monitoring/observability improvements you would implement:
- Troubleshooting techniques you found most valuable in OpenShift:
Evaluation criteria
When this part of the final project is assessed, the focus is on process and documentation, not just “having a working app”. Your report should demonstrate:
- Use of OpenShift-native information
- You relied on logs, events, metrics, and object status from OpenShift, not only on local tools.
- Clear troubleshooting reasoning
- Your hypotheses follow logically from observed data.
- You updated your approach when evidence contradicted your assumptions.
- Reproducibility
- Commands, console paths, and steps are specific enough that someone else could follow them.
- Clarity and conciseness
- You avoid unnecessary detail while preserving key evidence and conclusions.
- Reflection
- You identify improvements to monitoring and troubleshooting practices for future deployments.
Use this exercise to practice operating your OpenShift-deployed application like a real service: monitor it, investigate issues systematically, and communicate your findings in a way that others can understand and build upon.