12.2 Metrics and alerts

Types of Metrics in OpenShift

In OpenShift, most “monitoring” data you work with falls into a small set of metric types. Understanding them helps you create meaningful alerts instead of noisy ones.

Core metric types

Counter

Monotonically increasing value (only goes up, or resets).
Examples:

container_cpu_usage_seconds_total
http_requests_total

Typical use: rates per second or per minute using PromQL rate() or irate().

Gauge

Value that can go up and down.
Examples:

node_memory_MemAvailable_bytes
kube_pod_container_status_restarts_total (often behaves like a counter but is modeled as a gauge)
cluster:usage:cpu_cores:sum

Typical use: snapshots, thresholds (e.g., alert if > X).

Histogram

Measures distributions of observations (size, latency, duration).
Implemented as multiple series with buckets, e.g. _bucket, _sum, *_count.
Examples:

apiserver_request_duration_seconds_bucket

Typical use: SLIs like response time percentiles (p90, p95, p99).

Summary

Similar to histogram but calculates quantiles on the client side.
Less commonly used for cluster-level alerting in OpenShift; histograms are preferred.

High-level categories of metrics

Platform / cluster metrics

Collected by the cluster monitoring stack, focused on:

Nodes (CPU, memory, disk, network)
Kubernetes components (API server, controller manager, etcd)
OpenShift components (SDN, OAuth server, Router, etc.)

Primary for platform SRE / cluster admins.

Application metrics

Emitted by workloads you deploy in namespaces.
Typical patterns:

Business metrics (orders processed, jobs completed).
Technical metrics (request latency, error rate, queue depth).

Scraped by user workload monitoring (if configured) or external Prometheus.

Metrics Collection and Exposure

Prometheus as the core metrics engine

OpenShift’s metrics and alerting are centered on Prometheus:

Cluster Monitoring stack (managed by the Cluster Monitoring Operator):

Operates in openshift-monitoring.
Contains:

Prometheus instances (platform metrics)
Alertmanager
Thanos components (in some configurations)
Grafana (depending on version and configuration)

User Workload Monitoring:

Optional.
For scraping and alerting on application metrics in user namespaces, managed under openshift-user-workload-monitoring when enabled.

How metrics get into Prometheus

Prometheus relies on a pull model (scraping):

Targets expose metrics on an HTTP endpoint, typically /metrics, in Prometheus text format.
Prometheus scrapes these endpoints on a fixed interval (e.g., every 30s).

In OpenShift, you generally don’t hand-edit prometheus.yml. Instead, you use:

ServiceMonitor

Custom resource (CRD) that describes how to scrape metrics from a Service.
Example fields:

selector (which Services to match)
endpoints (path, port, scheme, interval)

Belongs to user workload monitoring; cluster admins manage the cluster-level equivalents.

PodMonitor

Similar to ServiceMonitor but targets Pods directly.
Useful when you don’t expose a Service or want pod-level selection.

Scrape configurations for platform components

Managed automatically by the Cluster Monitoring Operator for core OpenShift components (you do not modify these directly).

Example: ServiceMonitor (conceptual example)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  namespace: my-namespace
spec:
  selector:
    matchLabels:
      app: my-app
  namespaceSelector:
    matchNames:
      - my-namespace
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Working with Metrics via the OpenShift Console

The OpenShift web console provides multiple entry points for metrics:

Observe → Metrics:

PromQL query interface.
You can:

Run queries against the cluster Prometheus.
View graphs over time.
Inspect series labels.

Workloads and Nodes detail pages (dashboards):

Surface key metrics (CPU, memory, network, restarts) for:

Pods
Deployments
Nodes

These views are built on top of Prometheus queries.

Metrics visible here are primarily platform-level metrics; application metrics appear when user workload monitoring and the relevant ServiceMonitor/PodMonitor are configured.

Alerting Concepts in OpenShift

Alerts in OpenShift are defined in Prometheus alerting rules and handled by Alertmanager. The Cluster Monitoring Operator manages both for the platform stack.

Key components

Alerting rules

Expressions written in PromQL that evaluate metrics and produce alert states.
Each rule defines:

A name
A PromQL expression
A for duration (optional)
Labels (severity, scope, etc.)
Annotations (human-readable message, runbook link)

Alertmanager

Receives alerts from Prometheus.
Responsible for:

Grouping alerts
Silencing (temporarily disabling notifications)
Routing to receivers (email, webhooks, etc.)
Deduplication and inhibition

Alerting integration in the console

Observe → Alerts shows:

Active, pending, and silenced alerts
Alert details, labels, history

You can:

Filter by severity, namespace, or component
Create silences for specific alert label sets

PromQL for Alerting

PromQL (Prometheus Query Language) is used both for dashboards and alerts. For alerts, you usually:

Transform raw metrics into a useful signal, often using:

rate() / irate() for Counters
sum, avg, max, min, count for aggregation
histogram_quantile() for latency metrics from histograms

Apply a condition:

Compare against thresholds:

$metric > threshold

Check ratios:

$error_rate / $total_requests > 0.05

Use label filters to scope what you alert on:

metric{namespace="my-namespace", app="my-app"}

Common helper examples:

Rate of HTTP requests over 5 minutes:
$$ \text{rate}(http\_requests\_total[5m]) $$
5xx error ratio:
$$
\frac{
\text{rate}(http\_requests\_total\{code=~"5.."}[5m])
}{
\text{rate}(http\_requests\_total[5m])
}
$$
p95 latency from a histogram over 5 minutes:
$$
\text{histogram\_quantile}\left(
0.95,
\sum \text{ by }(le) (
\text{rate}(request\_duration\_seconds\_bucket[5m])
)
)
$$

Defining Alerting Rules

Cluster-level alerting rules for OpenShift components are managed by the platform. For extending alerting (especially for applications), OpenShift uses Kubernetes custom resources:

PrometheusRule CRD:

Declares groups of alerting and recording rules.
Managed by the Cluster Monitoring Operator (for platform) and, in some configurations, by user workload monitoring (for apps).

Structure of an alert rule

An alert rule contains:

alert – alert name (identifier)
expr – PromQL expression
for – required duration the condition must hold before firing
labels – metadata for routing and classification (e.g. severity, namespace)
annotations – human-friendly information (summary, description, runbook)

Conceptual example (application-oriented):

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-rules
  namespace: my-namespace
spec:
  groups:
    - name: my-app-availability
      rules:
        - alert: MyAppHighErrorRate
          expr: |
            (
              sum(rate(http_requests_total{app="my-app", status=~"5.."}[5m]))
            /
              sum(rate(http_requests_total{app="my-app"}[5m]))
            ) > 0.05
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High error rate for my-app"
            description: "More than 5% of my-app requests have been failing for 10 minutes."

This example illustrates typical alert design practices:

Uses rate() for counters over a 5-minute window.
Compares a ratio, not raw counts (helps on low traffic).
Requires the condition to hold for: 10m to reduce flapping.
Labels include severity for downstream routing.

Alert Lifecycle and States

Prometheus internally tracks alert states before passing them to Alertmanager:

Inactive

Condition is not met (expr evaluates to false or empty).

Pending

Condition is true, but the for duration has not yet elapsed.

Firing

Condition has been true for at least the for duration.

Alertmanager processes firing alerts:

Groups them (e.g. by namespace and alertname).
Applies routing rules and templates.
Sends notifications to receivers.
Allows silencing and inhibition rules (e.g. suppressing pod-level alerts when a node-level alert indicates a higher-level issue).

In the console under Observe → Alerts, you can see:

Which alerts are firing or pending.
Their labels and annotations.
History of state changes.

Metrics, SLIs, SLOs, and Alerting

Metrics form the basis of Service Level Indicators (SLIs), which you use to implement Service Level Objectives (SLOs). Alerts should typically be SLO-driven, not just raw metric thresholds.

Typical SLIs in OpenShift environments:

Availability / error rate

Example SLI: ratio of successful to total requests.
Metrics: http_requests_total, status code labels.

Latency

Example SLI: 95th percentile request duration.
Metrics: histograms like *_duration_seconds_bucket.

Resource saturation

CPU, memory, disk I/O, and queue depths.
Metrics: node and pod resource metrics, work queue metrics.

SLO-aligned alerting patterns:

Fast-burn alerts

Trigger if you will violate your SLO quickly (e.g., very high error rate).

Slow-burn alerts

Trigger if you are slowly consuming your error budget over a longer time.

These patterns are implemented using different PromQL windows and thresholds, all backed by the same base metrics.

Best Practices for Metrics and Alerts in OpenShift

Metric design

Expose low-cardinality labels

Avoid labels with unbounded values (e.g., user IDs, request IDs).
High cardinality can overload Prometheus and increase storage cost.

Use consistent naming and units

Suffix metrics with units: _seconds, _bytes, _total.
Make it obvious whether it’s a counter or gauge.

Instrument business-relevant metrics

Don’t limit yourself to CPU/memory; track rates that matter to your users (jobs, transactions, messages processed).

Alert design

Prefer symptom-based alerts over cause-based

Example:

Alert on “application error rate high” rather than “pod restarted once.”

Use platform alerts for underlying health (nodes, etcd, API server) and application alerts for user-visible symptoms.

Avoid noisy alerts

Always use appropriate windows and for durations.
Consider traffic volume: percentages or ratios instead of absolute counts.

Use severities consistently

E.g.:

critical for user-visible outages or data loss.
warning for degraded service or approaching capacity.
info for non-urgent signals.

Include clear runbooks or hints

Use annotations.description to:

Explain likely impact.
Suggest first troubleshooting steps.
Link to internal docs if available.

Operational considerations

Be aware of the scope:

Platform alerts (managed by OpenShift) vs application alerts (your responsibility).
Avoid duplicating platform-level alerts in your own rules.

Observe resource usage of metrics and alerts:

Too many extremely detailed metrics or high-scrape-frequency targets can strain the Prometheus instances.
Tune:

Scrape intervals
Retention periods
Label cardinality

Validate alerts before relying on them:

Use the console Metrics explorer to:

Run the alert expression as a query.
Check that it behaves as expected during normal and failure scenarios.

Dry-run new alerts by setting them to lower severity or only logging initially.

Putting It Together in an OpenShift Environment

In practice, using metrics and alerts on OpenShift typically looks like:

Instrument your application with metrics (e.g. Prometheus client libraries).
Expose /metrics and create a Service pointing to that port.
Create a ServiceMonitor so user workload monitoring scrapes your app.
Explore metrics using the console’s Metrics tab, refine PromQL queries.
Define PrometheusRule objects with well-designed alerting rules.
Use the Alerts view to:

Observe alert states.
Tune thresholds and durations.
Create silences during maintenance.

This workflow builds on the OpenShift monitoring stack that is already collecting cluster metrics and shipping default alerts, allowing you to extend it with application-specific metrics and alerting tailored to your workloads.

Comments

Please login to add a comment.

Don't have an account? Register now!