Table of Contents
Types of Metrics in OpenShift
In OpenShift, most “monitoring” data you work with falls into a small set of metric types. Understanding them helps you create meaningful alerts instead of noisy ones.
Core metric types
- Counter
- Monotonically increasing value (only goes up, or resets).
- Examples:
container_cpu_usage_seconds_totalhttp_requests_total- Typical use: rates per second or per minute using PromQL
rate()orirate(). - Gauge
- Value that can go up and down.
- Examples:
node_memory_MemAvailable_byteskube_pod_container_status_restarts_total(often behaves like a counter but is modeled as a gauge)cluster:usage:cpu_cores:sum- Typical use: snapshots, thresholds (e.g., alert if > X).
- Histogram
- Measures distributions of observations (size, latency, duration).
- Implemented as multiple series with buckets, e.g.
_bucket,_sum,*_count. - Examples:
apiserver_request_duration_seconds_bucket- Typical use: SLIs like response time percentiles (p90, p95, p99).
- Summary
- Similar to histogram but calculates quantiles on the client side.
- Less commonly used for cluster-level alerting in OpenShift; histograms are preferred.
High-level categories of metrics
- Platform / cluster metrics
- Collected by the cluster monitoring stack, focused on:
- Nodes (CPU, memory, disk, network)
- Kubernetes components (API server, controller manager, etcd)
- OpenShift components (SDN, OAuth server, Router, etc.)
- Primary for platform SRE / cluster admins.
- Application metrics
- Emitted by workloads you deploy in namespaces.
- Typical patterns:
- Business metrics (orders processed, jobs completed).
- Technical metrics (request latency, error rate, queue depth).
- Scraped by user workload monitoring (if configured) or external Prometheus.
Metrics Collection and Exposure
Prometheus as the core metrics engine
OpenShift’s metrics and alerting are centered on Prometheus:
- Cluster Monitoring stack (managed by the Cluster Monitoring Operator):
- Operates in
openshift-monitoring. - Contains:
- Prometheus instances (platform metrics)
- Alertmanager
- Thanos components (in some configurations)
- Grafana (depending on version and configuration)
- User Workload Monitoring:
- Optional.
- For scraping and alerting on application metrics in user namespaces, managed under
openshift-user-workload-monitoringwhen enabled.
How metrics get into Prometheus
Prometheus relies on a pull model (scraping):
- Targets expose metrics on an HTTP endpoint, typically
/metrics, in Prometheus text format. - Prometheus scrapes these endpoints on a fixed interval (e.g., every 30s).
In OpenShift, you generally don’t hand-edit prometheus.yml. Instead, you use:
- ServiceMonitor
- Custom resource (CRD) that describes how to scrape metrics from a
Service. - Example fields:
selector(which Services to match)endpoints(path, port, scheme, interval)- Belongs to user workload monitoring; cluster admins manage the cluster-level equivalents.
- PodMonitor
- Similar to
ServiceMonitorbut targets Pods directly. - Useful when you don’t expose a Service or want pod-level selection.
- Scrape configurations for platform components
- Managed automatically by the Cluster Monitoring Operator for core OpenShift components (you do not modify these directly).
Example: ServiceMonitor (conceptual example)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: my-namespace
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- my-namespace
endpoints:
- port: metrics
interval: 30s
path: /metricsWorking with Metrics via the OpenShift Console
The OpenShift web console provides multiple entry points for metrics:
- Observe → Metrics:
- PromQL query interface.
- You can:
- Run queries against the cluster Prometheus.
- View graphs over time.
- Inspect series labels.
- Workloads and Nodes detail pages (dashboards):
- Surface key metrics (CPU, memory, network, restarts) for:
- Pods
- Deployments
- Nodes
- These views are built on top of Prometheus queries.
Metrics visible here are primarily platform-level metrics; application metrics appear when user workload monitoring and the relevant ServiceMonitor/PodMonitor are configured.
Alerting Concepts in OpenShift
Alerts in OpenShift are defined in Prometheus alerting rules and handled by Alertmanager. The Cluster Monitoring Operator manages both for the platform stack.
Key components
- Alerting rules
- Expressions written in PromQL that evaluate metrics and produce alert states.
- Each rule defines:
- A name
- A PromQL expression
- A
forduration (optional) - Labels (severity, scope, etc.)
- Annotations (human-readable message, runbook link)
- Alertmanager
- Receives alerts from Prometheus.
- Responsible for:
- Grouping alerts
- Silencing (temporarily disabling notifications)
- Routing to receivers (email, webhooks, etc.)
- Deduplication and inhibition
- Alerting integration in the console
- Observe → Alerts shows:
- Active, pending, and silenced alerts
- Alert details, labels, history
- You can:
- Filter by severity, namespace, or component
- Create silences for specific alert label sets
PromQL for Alerting
PromQL (Prometheus Query Language) is used both for dashboards and alerts. For alerts, you usually:
- Transform raw metrics into a useful signal, often using:
rate()/irate()for Counterssum,avg,max,min,countfor aggregationhistogram_quantile()for latency metrics from histograms- Apply a condition:
- Compare against thresholds:
$metric > threshold- Check ratios:
$error_rate / $total_requests > 0.05- Use label filters to scope what you alert on:
metric{namespace="my-namespace", app="my-app"}
Common helper examples:
- Rate of HTTP requests over 5 minutes:
$$ \text{rate}(http\_requests\_total[5m]) $$ - 5xx error ratio:
$$
\frac{
\text{rate}(http\_requests\_total\{code=~"5.."}[5m])
}{
\text{rate}(http\_requests\_total[5m])
}
$$ - p95 latency from a histogram over 5 minutes:
$$
\text{histogram\_quantile}\left(
0.95,
\sum \text{ by }(le) (
\text{rate}(request\_duration\_seconds\_bucket[5m])
)
)
$$
Defining Alerting Rules
Cluster-level alerting rules for OpenShift components are managed by the platform. For extending alerting (especially for applications), OpenShift uses Kubernetes custom resources:
- PrometheusRule CRD:
- Declares groups of alerting and recording rules.
- Managed by the Cluster Monitoring Operator (for platform) and, in some configurations, by user workload monitoring (for apps).
Structure of an alert rule
An alert rule contains:
alert– alert name (identifier)expr– PromQL expressionfor– required duration the condition must hold before firinglabels– metadata for routing and classification (e.g.severity,namespace)annotations– human-friendly information (summary, description, runbook)
Conceptual example (application-oriented):
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-rules
namespace: my-namespace
spec:
groups:
- name: my-app-availability
rules:
- alert: MyAppHighErrorRate
expr: |
(
sum(rate(http_requests_total{app="my-app", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{app="my-app"}[5m]))
) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate for my-app"
description: "More than 5% of my-app requests have been failing for 10 minutes."This example illustrates typical alert design practices:
- Uses
rate()for counters over a 5-minute window. - Compares a ratio, not raw counts (helps on low traffic).
- Requires the condition to hold
for: 10mto reduce flapping. - Labels include
severityfor downstream routing.
Alert Lifecycle and States
Prometheus internally tracks alert states before passing them to Alertmanager:
- Inactive
- Condition is not met (
exprevaluates to false or empty). - Pending
- Condition is true, but the
forduration has not yet elapsed. - Firing
- Condition has been true for at least the
forduration.
Alertmanager processes firing alerts:
- Groups them (e.g. by namespace and alertname).
- Applies routing rules and templates.
- Sends notifications to receivers.
- Allows silencing and inhibition rules (e.g. suppressing pod-level alerts when a node-level alert indicates a higher-level issue).
In the console under Observe → Alerts, you can see:
- Which alerts are firing or pending.
- Their labels and annotations.
- History of state changes.
Metrics, SLIs, SLOs, and Alerting
Metrics form the basis of Service Level Indicators (SLIs), which you use to implement Service Level Objectives (SLOs). Alerts should typically be SLO-driven, not just raw metric thresholds.
Typical SLIs in OpenShift environments:
- Availability / error rate
- Example SLI: ratio of successful to total requests.
- Metrics:
http_requests_total, status code labels. - Latency
- Example SLI: 95th percentile request duration.
- Metrics: histograms like
*_duration_seconds_bucket. - Resource saturation
- CPU, memory, disk I/O, and queue depths.
- Metrics: node and pod resource metrics, work queue metrics.
SLO-aligned alerting patterns:
- Fast-burn alerts
- Trigger if you will violate your SLO quickly (e.g., very high error rate).
- Slow-burn alerts
- Trigger if you are slowly consuming your error budget over a longer time.
These patterns are implemented using different PromQL windows and thresholds, all backed by the same base metrics.
Best Practices for Metrics and Alerts in OpenShift
Metric design
- Expose low-cardinality labels
- Avoid labels with unbounded values (e.g., user IDs, request IDs).
- High cardinality can overload Prometheus and increase storage cost.
- Use consistent naming and units
- Suffix metrics with units:
_seconds,_bytes,_total. - Make it obvious whether it’s a counter or gauge.
- Instrument business-relevant metrics
- Don’t limit yourself to CPU/memory; track rates that matter to your users (jobs, transactions, messages processed).
Alert design
- Prefer symptom-based alerts over cause-based
- Example:
- Alert on “application error rate high” rather than “pod restarted once.”
- Use platform alerts for underlying health (nodes, etcd, API server) and application alerts for user-visible symptoms.
- Avoid noisy alerts
- Always use appropriate windows and
fordurations. - Consider traffic volume: percentages or ratios instead of absolute counts.
- Use severities consistently
- E.g.:
criticalfor user-visible outages or data loss.warningfor degraded service or approaching capacity.infofor non-urgent signals.- Include clear runbooks or hints
- Use
annotations.descriptionto: - Explain likely impact.
- Suggest first troubleshooting steps.
- Link to internal docs if available.
Operational considerations
- Be aware of the scope:
- Platform alerts (managed by OpenShift) vs application alerts (your responsibility).
- Avoid duplicating platform-level alerts in your own rules.
- Observe resource usage of metrics and alerts:
- Too many extremely detailed metrics or high-scrape-frequency targets can strain the Prometheus instances.
- Tune:
- Scrape intervals
- Retention periods
- Label cardinality
- Validate alerts before relying on them:
- Use the console Metrics explorer to:
- Run the alert expression as a query.
- Check that it behaves as expected during normal and failure scenarios.
- Dry-run new alerts by setting them to lower severity or only logging initially.
Putting It Together in an OpenShift Environment
In practice, using metrics and alerts on OpenShift typically looks like:
- Instrument your application with metrics (e.g. Prometheus client libraries).
- Expose
/metricsand create aServicepointing to that port. - Create a
ServiceMonitorso user workload monitoring scrapes your app. - Explore metrics using the console’s Metrics tab, refine PromQL queries.
- Define
PrometheusRuleobjects with well-designed alerting rules. - Use the Alerts view to:
- Observe alert states.
- Tune thresholds and durations.
- Create silences during maintenance.
This workflow builds on the OpenShift monitoring stack that is already collecting cluster metrics and shipping default alerts, allowing you to extend it with application-specific metrics and alerting tailored to your workloads.