Table of Contents
Concept and Goals of Horizontal Pod Autoscaling
Horizontal Pod Autoscaling (HPA) in OpenShift automatically adjusts the number of pod replicas for a workload based on observed metrics. Instead of manually scaling a Deployment/DeploymentConfig up or down, HPA continuously evaluates metrics such as CPU or custom application metrics and changes replica counts accordingly.
Key characteristics:
- Scales horizontally by changing pod replica count, not pod size or node size.
- Works with controllers that manage replicas (e.g.,
Deployment,DeploymentConfig,StatefulSetin some cases). - Reacts to metrics over time, not to single spikes.
- Enforces minimum and maximum replica limits for safety and cost control.
HPA is best suited for:
- Stateless or pseudo-stateless workloads.
- Applications where performance correlates reasonably with a simple metric (CPU, memory, request rate, queue length, etc.).
- Scenarios with predictable spikes (e.g., daytime traffic) or unpredictable but metrics-driven load.
How HPA Works in OpenShift
At a high level, HPA involves:
- Metrics collection
- A metrics stack (e.g., OpenShift’s built‑in metrics, cluster monitoring) gathers pod and/or custom metrics.
- HPA controller
- A control loop (in the Kubernetes control plane) periodically checks metrics against the scaling rules defined in the HPA object.
- Replica adjustment
- The controller calculates the desired replica count and updates the target controller (
Deployment,DeploymentConfig, etc.). - Workload reconciliation
- The target controller creates or removes pods to reach the desired number of replicas.
The HPA controller uses the observed metrics and a target value to compute the desired replicas. For resource metrics like CPU, a typical formula is:
$$
\text{desiredReplicas} = \text{currentReplicas} \times \frac{\text{currentMetric}}{\text{targetMetric}}
$$
rounded to an integer and clamped between minReplicas and maxReplicas.
Supported Metrics Types
OpenShift’s HPA uses the same metric types as upstream Kubernetes, but the actual availability of each depends on cluster configuration.
Common categories:
- Resource metrics
- Built-in metrics such as CPU and memory usage per pod.
- Typically available if the cluster metrics pipeline is enabled.
- Object metrics
- Metrics on other Kubernetes objects (e.g., requests per second on a
Service, queue length in an external system exposed via a metrics adapter). - External metrics
- Metrics that are not tied to Kubernetes objects (e.g., messages in a cloud message queue, external monitoring system metrics).
- Custom application metrics
- Application‑level metrics (such as requests per second, latency, backlog) exposed by the app and made available through a metrics adapter.
On many OpenShift clusters, you will most commonly start with CPU (and sometimes memory) metrics, and only later integrate custom or external metrics as your observability and metrics stack matures.
Defining an HPA Object
The HPA is defined as a standard Kubernetes resource (HorizontalPodAutoscaler) that references a scalable target and one or more metrics.
Basic fields:
scaleTargetRef– which controller to scale:- Kind:
Deployment,DeploymentConfig, etc. - Name: the object name.
minReplicas– lower bound on replica count.maxReplicas– upper bound on replica count.metrics– list of metric sources and targets.
Example: CPU-based HPA for a Deployment:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-frontend-hpa
namespace: my-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-frontend
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70This configuration:
- Ensures at least 2 and at most 10 replicas.
- Tries to keep average CPU utilization around 70% across all pods.
Creating and Managing HPAs with `oc`
You can define HPAs either by manifest or via oc commands.
Creating a basic HPA
CPU-based HPA using oc:
oc autoscale deployment/web-frontend \
--cpu-percent=70 \
--min=2 \
--max=10
For DeploymentConfig:
oc autoscale dc/my-api \
--cpu-percent=60 \
--min=1 \
--max=8This command generates an HPA resource for the chosen object in the current project/namespace.
Inspecting and describing HPAs
List HPAs:
oc get hpaCheck details and current scaling decisions:
oc describe hpa web-frontend-hpaYou’ll see information like:
- Current replicas and desired replicas
- Metrics and current values
- Events that show recent scaling actions
Updating and deleting HPAs
Update the HPA manifest (oc edit hpa web-frontend-hpa) or apply a changed YAML:
oc apply -f web-frontend-hpa.yamlDelete an HPA (stopping autoscaling but not removing the workload):
oc delete hpa web-frontend-hpaInteraction with Resource Requests and Limits
HPA behavior is strongly influenced by resource requests and limits:
- CPU-based autoscaling uses metrics relative to CPU requests when using
averageUtilization. - If requests are too small, CPU utilization may always look very high, causing over-scaling.
- If requests are too large, utilization may remain low, preventing scaling up even under load.
- Memory-based autoscaling uses memory metrics; note that memory usage can be less elastic than CPU and may not be ideal as the sole scaling trigger.
Best practice:
- Ensure your pods have realistic
resources.requests(especially for CPU). - Keep HPA target utilization and resource requests aligned with actual workload behavior (based on measurements, not guesses).
Example pod spec excerpt that HPA will rely on:
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
If averageUtilization is 70, HPA aims for about 70% of the CPU requests, not the limits.
Scaling Behavior: Stabilization and Cooldown
HPA is not instantaneous; it has built‑in protections to avoid oscillation:
- Control loop period
- The HPA controller runs periodically (typically every 15 seconds or similar), not continuously.
- Scale up vs scale down behavior
- Scale up is usually faster to respond to increasing load.
- Scale down is often more conservative, using longer stabilization windows to avoid flapping.
- Stabilization windows
- HPA tracks recent recommendations; for downscaling, it may wait a configured period and use the highest recommendation in that window.
In autoscaling/v2, you can configure behavior more explicitly (if supported/enabled in your OpenShift version):
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60This example:
- Limits scale-up to doubling the replicas in 60 seconds.
- Limits scale-down to halving replicas every 60 seconds, with a 5-minute stabilization window.
Using Custom and External Metrics (Conceptual Overview)
In more advanced setups, HPA can scale based on custom or external metrics via metrics adapters. Without going into installation details:
- Custom application metrics example
- Scale based on
http_requests_per_secondexposed by your app. - External metrics example
- Scale based on
queue_lengthin a message queue system.
A conceptual metric definition within an HPA might look like:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "50"or
metrics:
- type: External
external:
metric:
name: job_queue_length
target:
type: AverageValue
averageValue: "100"These require that:
- The metric is exposed in a compatible format (often Prometheus).
- A metrics adapter is configured in the cluster to translate between the monitoring system and Kubernetes’ metrics API.
Workload Considerations and Patterns
Not all workloads are good candidates for HPA. Consider:
- Stateless services
- Typically ideal: web frontends, stateless APIs, microservices.
- Stateful services
- May be more constrained; scaling often needs data rebalance, etc.
- HPA can sometimes be used carefully in combination with other mechanisms.
- Long-running jobs
- For batch or queue-based workloads, autoscaling on queue depth or job count can be effective.
- Initialization and warm‑up time
- If pods take a long time to become ready, HPA may lag behind spikes. Over-provisioning
minReplicasor combining HPA with pre-warming strategies can help.
Also consider:
- Startup and readiness probes
- Ensure probes are correctly set so that newly scaled pods don’t receive traffic before they’re ready.
- Graceful shutdown
- When scaling down, ensure your application can drain and shutdown without losing work or data.
HPA and Other Scaling Mechanisms
HPA operates at the pod replica level and interacts with other scaling features:
- Cluster autoscaler / node scaling
- HPA may increase pod replicas beyond current cluster capacity.
- A separate mechanism (e.g., cluster autoscaler) can then add nodes to satisfy the new demand.
- Vertical pod autoscaling
- HPA changes the number of pods.
- Vertical mechanisms adjust resource requests for each pod.
- Combining both requires careful tuning to avoid conflicts.
- Manual scaling
- If you manually change
spec.replicason the target controller, HPA will override it on its next reconciliation unless you disable or delete the HPA.
In OpenShift environments, teams often:
- Start with manual scaling.
- Introduce HPA on critical or user-facing services.
- Later combine HPA with cluster autoscaling to achieve end‑to‑end elasticity.
Observability and Troubleshooting HPA
To operate HPA effectively, you must be able to see what it is doing and why.
Useful checks:
- HPA status
oc describe hpa <name>for events and current metrics values.- Pod metrics
- Validate that the metrics seen by HPA match your expectations.
- Workload logs
- Check if pods under high utilization correspond to user-visible slowdowns.
Common symptoms and their likely causes:
- HPA never scales up
- Metrics not available or not recognized (e.g., missing metrics adapter).
- Target utilization too high compared to actual load.
- Resource requests set too large, so utilization stays low.
- HPA constantly scales up and down
- Target metric too sensitive for bursty traffic.
- Stabilization window too short or not configured.
- Metrics noisy; consider smoothing or different metrics.
- Pods scaled up but performance still poor
- Bottleneck is elsewhere (database, external dependency, network).
- Application not horizontally scalable (e.g., shared state, locking).
Instrumenting your application with appropriate metrics, and reviewing both HPA status and application performance, is key to tuning autoscaling.
Best Practices for Horizontal Pod Autoscaling in OpenShift
- Begin with conservative bounds
- Choose safe
minReplicasto handle baseline traffic. - Set
maxReplicasto a value your infrastructure can realistically support. - Align metrics with real performance
- Use CPU as a starting point, but move to application-level metrics where possible.
- Tune based on real data
- Measure typical and peak load, then adjust:
- Resource requests/limits
- Target utilization
- Min/max replicas
- Avoid overreaction to short spikes
- Use stabilization windows and appropriate target values.
- Ensure graceful scaling
- Correct probes.
- Graceful shutdown behavior.
- Idempotent request handling.
- Validate in non-production environments
- Test autoscaling under synthetic load to confirm behavior before enabling in production.
By combining these practices with OpenShift’s monitoring and logging capabilities, you can build applications that respond automatically and predictably to changing load using horizontal pod autoscaling.