Table of Contents
Why Monitoring, Logging, and Observability Matter on OpenShift
Modern applications on OpenShift are distributed, dynamic, and often composed of many microservices. Pods are created and destroyed frequently, workloads scale automatically, and failures can be transient or hidden. In this environment, traditional “check a single server” style monitoring is not enough.
Monitoring, logging, and observability together enable you to:
- Understand the current health of the platform and applications
- Detect and react to failures and performance regressions
- Analyze trends and capacity usage over time
- Troubleshoot complex, cross-service issues
- Meet SLOs/SLAs and compliance requirements
On OpenShift, these capabilities are provided by an integrated stack that builds on Kubernetes concepts but adds opinionated defaults, security controls, and multi-tenancy features.
Core Concepts and Terminology
Before looking at the specific OpenShift components, it helps to differentiate:
- Monitoring: Continuous collection and visualization of metrics (numbers over time) to track health, performance, and resource usage. Typically used for dashboards and alerting.
- Logging: Collection, storage, and querying of event logs (structured or unstructured text) generated by applications and platform components.
- Tracing: Recording and visualizing the path of individual requests across multiple services, with timing information.
- Observability: The overall ability to understand internal states of the system from externally visible outputs (metrics, logs, traces, events).
OpenShift provides built-in monitoring and integrates with logging and tracing stacks to achieve practical observability for both platform and user workloads.
Layers of Observability in OpenShift
Observability on OpenShift can be thought of in three main layers:
- Platform (cluster) layer
- Metrics: API server, controllers, etcd, kubelet, SDN, ingress, storage, Operators, nodes.
- Logs: Control plane logs, infrastructure workloads, ingress, Operators.
- Focus: Cluster health, capacity, upgrade readiness, SLA of the platform itself.
- Application (user workload) layer
- Metrics: Application performance (e.g., request rate, error rate, latency), business metrics, custom metrics.
- Logs: Application stdout/stderr, structured logs, audit trails.
- Focus: Application correctness, performance, debugging logical errors.
- Request (end-to-end) layer
- Traces: End-to-end view of a single transaction through multiple services.
- Focus: Latency hotspots, dependency analysis, bottlenecks in distributed calls.
Different OpenShift components target these different layers; cluster administrators and application developers typically interact with them at different levels of detail and with different permissions.
Observability Responsibilities and Roles
On OpenShift, responsibilities are usually split between:
- Cluster administrators
- Ensure the platform monitoring and logging are deployed and healthy.
- Configure retention, scaling, and backends for logs and metrics.
- Provide access patterns (dashboards, APIs) and guardrails.
- Monitor infrastructure components and Operators.
- Application teams / developers
- Expose useful application-level metrics.
- Emit structured, meaningful logs to stdout/stderr.
- Add tracing instrumentation where relevant.
- Use the provided tools to monitor and troubleshoot their workloads.
This separation shapes how OpenShift structures its “platform” vs “user” monitoring and logging.
Data Sources: What Gets Observed
Across monitoring, logging, and tracing, the main data sources in OpenShift include:
- Kubernetes and OpenShift components
- API server, controller manager, scheduler
- etcd
- Kubelet and node-level exporters
- SDN / network plugins and ingress controllers
- Storage drivers, CSI sidecars
- OpenShift-specific controllers (e.g., cluster version operator, machine API)
- Workload resources
- Pods, Deployments, StatefulSets, Jobs, CronJobs
- Services, Routes, Ingress
- PersistentVolumes, PersistentVolumeClaims
- Custom resources managed by Operators
- Applications themselves
- Business and domain metrics
- Application logs
- Traces and spans for distributed calls
In practice, each of these either exposes metrics endpoints, writes logs to stdout/stderr, or offers trace hooks via an SDK or sidecar.
Observability Data Types
Metrics
Metrics are time-series data: numeric values that change over time.
Common categories:
- Resource metrics
- CPU usage, memory usage, filesystem usage
- Request vs limit utilization
- Node resource utilization
- Platform metrics
- API server request latency and error counts
- etcd operation latency and failures
- Ingress request counts, 4xx/5xx rates
- Controller queue lengths, reconciliation errors
- Application metrics
- Requests per second (RPS/QPS), success/error rates
- Latency histograms or quantiles
- Queue lengths, worker pool utilization
- Business KPIs (e.g., orders per minute)
Metrics are typically pulled from HTTP endpoints (e.g., /metrics in Prometheus format) or collected by node-level agents.
Logs
Logs are event records, usually text-based, sometimes structured as JSON.
Key log streams in OpenShift environments:
- Application logs
- Written to stdout/stderr from containers
- Often structured with fields such as timestamp, level, request ID
- Infrastructure and platform logs
- Control plane components
- Ingress controllers and proxies
- Storage drivers and Operators
- Node system logs (journal, kernel messages)
- Security and audit logs
- Kubernetes API audit logs
- Authentication/authorization logs
- Compliance and policy enforcement logs (depending on configuration)
Logs are essential when metrics indicate “something is wrong” but you need context to understand exactly what happened.
Traces
Traces represent individual requests as they traverse multiple services.
Fundamental elements:
- Trace: One end-to-end request (e.g., a web request).
- Span: A timed operation within that request (e.g., “call database”, “call payment service”).
- Context propagation: Passing trace IDs between services so spans can be linked.
In OpenShift environments with microservices and service meshes, tracing helps identify which service is responsible for high latency or errors in complex call chains.
Observability Patterns and Best Practices on OpenShift
Unified but Segregated: Platform vs User Observability
OpenShift emphasizes:
- Platform monitoring/logging
- Usually managed and operated by the cluster admin.
- Often restricted in access for security and multi-tenancy.
- Focused on cluster-wide and infrastructure metrics/logs.
- User/workload monitoring/logging
- Accessible by project/namespace owners.
- Tailored to application needs.
- Often uses the same tools/stack, but with separate data flows, retention, and permissions.
This pattern allows strong separation of concerns while giving app teams enough observability without exposing sensitive platform internals.
Labels and Metadata
Labels and annotations are crucial for making metrics, logs, and traces useful:
- Labels on Kubernetes/OpenShift resources (e.g.,
app,component,version,environment) help: - Filter metrics by application, environment, or component
- Group dashboard panels
- Correlate logs with deployments or versions
- Consistent label schemes across:
- Deployments/Pods
- Services and Routes
- Metrics (e.g., Prometheus labels)
- Logs (e.g., structured fields)
enable easier correlation between the different data types.
Correlation Across Metrics, Logs, and Traces
Effective observability in OpenShift hinges on correlating different types of signals:
- Use common identifiers
- Request ID or correlation ID added to:
- HTTP headers
- Log entries
- Trace context
- Pod or container IDs included in log records and metrics labels
- Use consistent naming
- Service names that match:
- Kubernetes
ServiceandDeploymentnames - Trace service names
- Dashboard panels
This allows workflows like:
- Start from an alert (metrics) → identify the affected pods and services → jump to logs →, if needed, inspect traces for specific requests.
Multi-Tenancy and Access Control
OpenShift environments often serve multiple teams or tenants. Observability must respect this:
- Segregate data by namespace or label where appropriate.
- Restrict access to:
- Only view logs/metrics for namespaces a user can access.
- Hide sensitive platform logs from developers if required.
- Provide shared, high-level dashboards for global status, but detailed views only where permitted.
This is particularly important in managed or shared clusters.
Capacity, Retention, and Performance
Observability systems themselves are resource-intensive. When planning monitoring and logging on OpenShift:
- Define:
- Retention periods for different data types (shorter for high-volume logs, longer for high-value metrics).
- Sampling strategies for traces and high-cardinality metrics.
- Watch out for:
- Excessive metric label cardinality (e.g., per-request IDs as labels).
- Huge log volumes from overly verbose logging (
debuglevel in production). - Right-size:
- Storage backends (e.g., object storage, distributed search indices).
- Alert rule complexity and frequency.
The goal is to balance observability depth with operational cost and performance.
Alerting Principles
Alerts are typically built on top of metrics and sometimes logs. Good alerting practices in OpenShift environments include:
- Prefer symptom-based alerts over infrastructure-only alerts:
- Example: Alert on “high error rate for public route” rather than “pod restart count increased” alone.
- Use multi-level severity (warning vs critical) with clear runbooks:
- Each alert should indicate what it likely means and suggested first steps.
- Avoid alert fatigue:
- Deduplicate alerts across replicas and services.
- Aggregate related conditions into a single higher-level signal where possible.
Alerting strategy is usually coordinated between cluster admins and application teams.
Integrating External Observability Systems
Many organizations already have centralized observability platforms. OpenShift is designed to integrate with such systems:
Common integration patterns:
- Forwarding logs to external log aggregation systems or SIEM platforms.
- Exporting metrics or scraping metrics from OpenShift into central observability stacks.
- Connecting tracing (e.g., OpenTelemetry) from OpenShift services to organization-wide tracing backends.
Key considerations when integrating:
- Authentication and authorization for cross-cluster or external access.
- Network connectivity and egress control.
- Data normalization (consistent naming, labels, and formats).
- Handling multi-cluster environments (unique cluster identifiers in labels/metadata).
Observability for Operators and Platform Services
OpenShift makes heavy use of Operators and custom resources. Observability here has some specifics:
- Operators often expose:
- Health metrics for reconciliations and errors.
- Custom resource metrics (counts, status summaries).
- Failures in Operators may manifest as:
- Stalled cluster upgrades.
- Stuck resources in non-ready states.
- Repeated reconciliation errors in logs.
For platform teams, having clear visibility into Operator health is critical to maintaining cluster stability and performing upgrades safely.
Observability in Dynamic and Ephemeral Environments
Many OpenShift workloads are short-lived or highly dynamic:
- Pods are ephemeral:
- Logs vanish when pods are deleted unless centrally collected.
- Metrics must be scraped frequently enough to catch short-lived pods.
- Autoscaling and rolling updates:
- New pods with new IPs and names appear constantly.
- Dashboards and queries should use labels (e.g.,
app=frontend) instead of specific pod names.
Observability approaches on OpenShift must be built around these dynamics, relying on labels and centralized collection rather than static hostnames or long-lived processes.
Observability and Reliability Practices
Finally, observability on OpenShift is closely linked to reliability engineering:
- Service Level Objectives (SLOs) and SLIs
- Define what “good” looks like (e.g., 99.9% of requests under 300 ms, 99.95% availability).
- Implement metrics and alerts that reflect these goals.
- Error budgets
- Use measured performance to determine how much risk (e.g., deployments, experiments) you can take.
- Continuous improvement
- After incidents, use observability data for post-incident reviews.
- Identify which signals or dashboards were missing or confusing and refine them.
OpenShift provides the primitives and integrations; how they are used is key to turning raw data into actionable insight and improved reliability.
In the following subsections of this course, you will see how these concepts are realized concretely in OpenShift’s built-in monitoring stack, metrics and alerts, logging architecture, distributed tracing, and practical troubleshooting workflows.