12 Monitoring, Logging, and Observability

Why Monitoring, Logging, and Observability Matter on OpenShift

Modern applications on OpenShift are distributed, dynamic, and often composed of many microservices. Pods are created and destroyed frequently, workloads scale automatically, and failures can be transient or hidden. In this environment, traditional “check a single server” style monitoring is not enough.

Monitoring, logging, and observability together enable you to:

Understand the current health of the platform and applications
Detect and react to failures and performance regressions
Analyze trends and capacity usage over time
Troubleshoot complex, cross-service issues
Meet SLOs/SLAs and compliance requirements

On OpenShift, these capabilities are provided by an integrated stack that builds on Kubernetes concepts but adds opinionated defaults, security controls, and multi-tenancy features.

Core Concepts and Terminology

Before looking at the specific OpenShift components, it helps to differentiate:

Monitoring: Continuous collection and visualization of metrics (numbers over time) to track health, performance, and resource usage. Typically used for dashboards and alerting.
Logging: Collection, storage, and querying of event logs (structured or unstructured text) generated by applications and platform components.
Tracing: Recording and visualizing the path of individual requests across multiple services, with timing information.
Observability: The overall ability to understand internal states of the system from externally visible outputs (metrics, logs, traces, events).

OpenShift provides built-in monitoring and integrates with logging and tracing stacks to achieve practical observability for both platform and user workloads.

Layers of Observability in OpenShift

Observability on OpenShift can be thought of in three main layers:

Platform (cluster) layer

Metrics: API server, controllers, etcd, kubelet, SDN, ingress, storage, Operators, nodes.
Logs: Control plane logs, infrastructure workloads, ingress, Operators.
Focus: Cluster health, capacity, upgrade readiness, SLA of the platform itself.

Application (user workload) layer

Metrics: Application performance (e.g., request rate, error rate, latency), business metrics, custom metrics.
Logs: Application stdout/stderr, structured logs, audit trails.
Focus: Application correctness, performance, debugging logical errors.

Request (end-to-end) layer

Traces: End-to-end view of a single transaction through multiple services.
Focus: Latency hotspots, dependency analysis, bottlenecks in distributed calls.

Different OpenShift components target these different layers; cluster administrators and application developers typically interact with them at different levels of detail and with different permissions.

Observability Responsibilities and Roles

On OpenShift, responsibilities are usually split between:

Cluster administrators

Ensure the platform monitoring and logging are deployed and healthy.
Configure retention, scaling, and backends for logs and metrics.
Provide access patterns (dashboards, APIs) and guardrails.
Monitor infrastructure components and Operators.

Application teams / developers

Expose useful application-level metrics.
Emit structured, meaningful logs to stdout/stderr.
Add tracing instrumentation where relevant.
Use the provided tools to monitor and troubleshoot their workloads.

This separation shapes how OpenShift structures its “platform” vs “user” monitoring and logging.

Data Sources: What Gets Observed

Across monitoring, logging, and tracing, the main data sources in OpenShift include:

Kubernetes and OpenShift components

API server, controller manager, scheduler
etcd
Kubelet and node-level exporters
SDN / network plugins and ingress controllers
Storage drivers, CSI sidecars
OpenShift-specific controllers (e.g., cluster version operator, machine API)

Workload resources

Pods, Deployments, StatefulSets, Jobs, CronJobs
Services, Routes, Ingress
PersistentVolumes, PersistentVolumeClaims
Custom resources managed by Operators

Applications themselves

Business and domain metrics
Application logs
Traces and spans for distributed calls

In practice, each of these either exposes metrics endpoints, writes logs to stdout/stderr, or offers trace hooks via an SDK or sidecar.

Observability Data Types

Metrics

Metrics are time-series data: numeric values that change over time.

Common categories:

Resource metrics

CPU usage, memory usage, filesystem usage
Request vs limit utilization
Node resource utilization

Platform metrics

API server request latency and error counts
etcd operation latency and failures
Ingress request counts, 4xx/5xx rates
Controller queue lengths, reconciliation errors

Application metrics

Requests per second (RPS/QPS), success/error rates
Latency histograms or quantiles
Queue lengths, worker pool utilization
Business KPIs (e.g., orders per minute)

Metrics are typically pulled from HTTP endpoints (e.g., /metrics in Prometheus format) or collected by node-level agents.

Logs

Logs are event records, usually text-based, sometimes structured as JSON.

Key log streams in OpenShift environments:

Application logs

Written to stdout/stderr from containers
Often structured with fields such as timestamp, level, request ID

Infrastructure and platform logs

Control plane components
Ingress controllers and proxies
Storage drivers and Operators
Node system logs (journal, kernel messages)

Security and audit logs

Kubernetes API audit logs
Authentication/authorization logs
Compliance and policy enforcement logs (depending on configuration)

Logs are essential when metrics indicate “something is wrong” but you need context to understand exactly what happened.

Traces

Traces represent individual requests as they traverse multiple services.

Fundamental elements:

Trace: One end-to-end request (e.g., a web request).
Span: A timed operation within that request (e.g., “call database”, “call payment service”).
Context propagation: Passing trace IDs between services so spans can be linked.

In OpenShift environments with microservices and service meshes, tracing helps identify which service is responsible for high latency or errors in complex call chains.

Observability Patterns and Best Practices on OpenShift

Unified but Segregated: Platform vs User Observability

OpenShift emphasizes:

Platform monitoring/logging

Usually managed and operated by the cluster admin.
Often restricted in access for security and multi-tenancy.
Focused on cluster-wide and infrastructure metrics/logs.

User/workload monitoring/logging

Accessible by project/namespace owners.
Tailored to application needs.
Often uses the same tools/stack, but with separate data flows, retention, and permissions.

This pattern allows strong separation of concerns while giving app teams enough observability without exposing sensitive platform internals.

Labels and Metadata

Labels and annotations are crucial for making metrics, logs, and traces useful:

Labels on Kubernetes/OpenShift resources (e.g., app, component, version, environment) help:

Filter metrics by application, environment, or component
Group dashboard panels
Correlate logs with deployments or versions

Consistent label schemes across:

Deployments/Pods
Services and Routes
Metrics (e.g., Prometheus labels)
Logs (e.g., structured fields)
enable easier correlation between the different data types.

Correlation Across Metrics, Logs, and Traces

Effective observability in OpenShift hinges on correlating different types of signals:

Use common identifiers

Request ID or correlation ID added to:

HTTP headers
Log entries
Trace context

Pod or container IDs included in log records and metrics labels

Use consistent naming

Service names that match:

Kubernetes Service and Deployment names
Trace service names
Dashboard panels

This allows workflows like:

Start from an alert (metrics) → identify the affected pods and services → jump to logs →, if needed, inspect traces for specific requests.

Multi-Tenancy and Access Control

OpenShift environments often serve multiple teams or tenants. Observability must respect this:

Segregate data by namespace or label where appropriate.
Restrict access to:

Only view logs/metrics for namespaces a user can access.
Hide sensitive platform logs from developers if required.

Provide shared, high-level dashboards for global status, but detailed views only where permitted.

This is particularly important in managed or shared clusters.

Capacity, Retention, and Performance

Observability systems themselves are resource-intensive. When planning monitoring and logging on OpenShift:

Define:

Retention periods for different data types (shorter for high-volume logs, longer for high-value metrics).
Sampling strategies for traces and high-cardinality metrics.

Watch out for:

Excessive metric label cardinality (e.g., per-request IDs as labels).
Huge log volumes from overly verbose logging (debug level in production).

Right-size:

Storage backends (e.g., object storage, distributed search indices).
Alert rule complexity and frequency.

The goal is to balance observability depth with operational cost and performance.

Alerting Principles

Alerts are typically built on top of metrics and sometimes logs. Good alerting practices in OpenShift environments include:

Prefer symptom-based alerts over infrastructure-only alerts:

Example: Alert on “high error rate for public route” rather than “pod restart count increased” alone.

Use multi-level severity (warning vs critical) with clear runbooks:

Each alert should indicate what it likely means and suggested first steps.

Avoid alert fatigue:

Deduplicate alerts across replicas and services.
Aggregate related conditions into a single higher-level signal where possible.

Alerting strategy is usually coordinated between cluster admins and application teams.

Integrating External Observability Systems

Many organizations already have centralized observability platforms. OpenShift is designed to integrate with such systems:

Common integration patterns:

Forwarding logs to external log aggregation systems or SIEM platforms.
Exporting metrics or scraping metrics from OpenShift into central observability stacks.
Connecting tracing (e.g., OpenTelemetry) from OpenShift services to organization-wide tracing backends.

Key considerations when integrating:

Authentication and authorization for cross-cluster or external access.
Network connectivity and egress control.
Data normalization (consistent naming, labels, and formats).
Handling multi-cluster environments (unique cluster identifiers in labels/metadata).

Observability for Operators and Platform Services

OpenShift makes heavy use of Operators and custom resources. Observability here has some specifics:

Operators often expose:

Health metrics for reconciliations and errors.
Custom resource metrics (counts, status summaries).

Failures in Operators may manifest as:

Stalled cluster upgrades.
Stuck resources in non-ready states.
Repeated reconciliation errors in logs.

For platform teams, having clear visibility into Operator health is critical to maintaining cluster stability and performing upgrades safely.

Observability in Dynamic and Ephemeral Environments

Many OpenShift workloads are short-lived or highly dynamic:

Pods are ephemeral:

Logs vanish when pods are deleted unless centrally collected.
Metrics must be scraped frequently enough to catch short-lived pods.

Autoscaling and rolling updates:

New pods with new IPs and names appear constantly.
Dashboards and queries should use labels (e.g., app=frontend) instead of specific pod names.

Observability approaches on OpenShift must be built around these dynamics, relying on labels and centralized collection rather than static hostnames or long-lived processes.

Observability and Reliability Practices

Finally, observability on OpenShift is closely linked to reliability engineering:

Service Level Objectives (SLOs) and SLIs

Define what “good” looks like (e.g., 99.9% of requests under 300 ms, 99.95% availability).
Implement metrics and alerts that reflect these goals.

Error budgets

Use measured performance to determine how much risk (e.g., deployments, experiments) you can take.

Continuous improvement

After incidents, use observability data for post-incident reviews.
Identify which signals or dashboards were missing or confusing and refine them.

OpenShift provides the primitives and integrations; how they are used is key to turning raw data into actionable insight and improved reliability.

In the following subsections of this course, you will see how these concepts are realized concretely in OpenShift’s built-in monitoring stack, metrics and alerts, logging architecture, distributed tracing, and practical troubleshooting workflows.

12.1 Built-in monitoring stack

12.2 Metrics and alerts

12.3 Logging architecture

12.4 Distributed tracing

12.5 Troubleshooting applications

Comments

Please login to add a comment.

Don't have an account? Register now!