11.1 Horizontal pod autoscaling

Table of Contents

Concept and Goals of Horizontal Pod Autoscaling

Horizontal Pod Autoscaling (HPA) in OpenShift automatically adjusts the number of pod replicas for a workload based on observed metrics. Instead of manually scaling a Deployment/DeploymentConfig up or down, HPA continuously evaluates metrics such as CPU or custom application metrics and changes replica counts accordingly.

Key characteristics:

Scales horizontally by changing pod replica count, not pod size or node size.
Works with controllers that manage replicas (e.g., Deployment, DeploymentConfig, StatefulSet in some cases).
Reacts to metrics over time, not to single spikes.
Enforces minimum and maximum replica limits for safety and cost control.

HPA is best suited for:

Stateless or pseudo-stateless workloads.
Applications where performance correlates reasonably with a simple metric (CPU, memory, request rate, queue length, etc.).
Scenarios with predictable spikes (e.g., daytime traffic) or unpredictable but metrics-driven load.

How HPA Works in OpenShift

At a high level, HPA involves:

Metrics collection

A metrics stack (e.g., OpenShift’s built‑in metrics, cluster monitoring) gathers pod and/or custom metrics.

HPA controller

A control loop (in the Kubernetes control plane) periodically checks metrics against the scaling rules defined in the HPA object.

Replica adjustment

The controller calculates the desired replica count and updates the target controller (Deployment, DeploymentConfig, etc.).

Workload reconciliation

The target controller creates or removes pods to reach the desired number of replicas.

The HPA controller uses the observed metrics and a target value to compute the desired replicas. For resource metrics like CPU, a typical formula is:

$$
\text{desiredReplicas} = \text{currentReplicas} \times \frac{\text{currentMetric}}{\text{targetMetric}}
$$

rounded to an integer and clamped between minReplicas and maxReplicas.

Supported Metrics Types

OpenShift’s HPA uses the same metric types as upstream Kubernetes, but the actual availability of each depends on cluster configuration.

Common categories:

Resource metrics

Built-in metrics such as CPU and memory usage per pod.
Typically available if the cluster metrics pipeline is enabled.

Object metrics

Metrics on other Kubernetes objects (e.g., requests per second on a Service, queue length in an external system exposed via a metrics adapter).

External metrics

Metrics that are not tied to Kubernetes objects (e.g., messages in a cloud message queue, external monitoring system metrics).

Custom application metrics

Application‑level metrics (such as requests per second, latency, backlog) exposed by the app and made available through a metrics adapter.

On many OpenShift clusters, you will most commonly start with CPU (and sometimes memory) metrics, and only later integrate custom or external metrics as your observability and metrics stack matures.

Defining an HPA Object

The HPA is defined as a standard Kubernetes resource (HorizontalPodAutoscaler) that references a scalable target and one or more metrics.

Basic fields:

scaleTargetRef – which controller to scale:

Kind: Deployment, DeploymentConfig, etc.
Name: the object name.

minReplicas – lower bound on replica count.
maxReplicas – upper bound on replica count.
metrics – list of metric sources and targets.

Example: CPU-based HPA for a Deployment:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend-hpa
  namespace: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This configuration:

Ensures at least 2 and at most 10 replicas.
Tries to keep average CPU utilization around 70% across all pods.

Creating and Managing HPAs with `oc`

You can define HPAs either by manifest or via oc commands.

Creating a basic HPA

CPU-based HPA using oc:

oc autoscale deployment/web-frontend \
  --cpu-percent=70 \
  --min=2 \
  --max=10

For DeploymentConfig:

oc autoscale dc/my-api \
  --cpu-percent=60 \
  --min=1 \
  --max=8

This command generates an HPA resource for the chosen object in the current project/namespace.

Inspecting and describing HPAs

List HPAs:

oc get hpa

Check details and current scaling decisions:

oc describe hpa web-frontend-hpa

You’ll see information like:

Current replicas and desired replicas
Metrics and current values
Events that show recent scaling actions

Updating and deleting HPAs

Update the HPA manifest (oc edit hpa web-frontend-hpa) or apply a changed YAML:

oc apply -f web-frontend-hpa.yaml

Delete an HPA (stopping autoscaling but not removing the workload):

oc delete hpa web-frontend-hpa

Interaction with Resource Requests and Limits

HPA behavior is strongly influenced by resource requests and limits:

CPU-based autoscaling uses metrics relative to CPU requests when using averageUtilization.

If requests are too small, CPU utilization may always look very high, causing over-scaling.
If requests are too large, utilization may remain low, preventing scaling up even under load.

Memory-based autoscaling uses memory metrics; note that memory usage can be less elastic than CPU and may not be ideal as the sole scaling trigger.

Best practice:

Ensure your pods have realistic resources.requests (especially for CPU).
Keep HPA target utilization and resource requests aligned with actual workload behavior (based on measurements, not guesses).

Example pod spec excerpt that HPA will rely on:

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

If averageUtilization is 70, HPA aims for about 70% of the CPU requests, not the limits.

Scaling Behavior: Stabilization and Cooldown

HPA is not instantaneous; it has built‑in protections to avoid oscillation:

Control loop period

The HPA controller runs periodically (typically every 15 seconds or similar), not continuously.

Scale up vs scale down behavior

Scale up is usually faster to respond to increasing load.
Scale down is often more conservative, using longer stabilization windows to avoid flapping.

Stabilization windows

HPA tracks recent recommendations; for downscaling, it may wait a configured period and use the highest recommendation in that window.

In autoscaling/v2, you can configure behavior more explicitly (if supported/enabled in your OpenShift version):

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60

This example:

Limits scale-up to doubling the replicas in 60 seconds.
Limits scale-down to halving replicas every 60 seconds, with a 5-minute stabilization window.

Using Custom and External Metrics (Conceptual Overview)

In more advanced setups, HPA can scale based on custom or external metrics via metrics adapters. Without going into installation details:

Custom application metrics example

Scale based on http_requests_per_second exposed by your app.

External metrics example

Scale based on queue_length in a message queue system.

A conceptual metric definition within an HPA might look like:

metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "50"

metrics:
  - type: External
    external:
      metric:
        name: job_queue_length
      target:
        type: AverageValue
        averageValue: "100"

These require that:

The metric is exposed in a compatible format (often Prometheus).
A metrics adapter is configured in the cluster to translate between the monitoring system and Kubernetes’ metrics API.

Workload Considerations and Patterns

Not all workloads are good candidates for HPA. Consider:

Stateless services

Typically ideal: web frontends, stateless APIs, microservices.

Stateful services

May be more constrained; scaling often needs data rebalance, etc.
HPA can sometimes be used carefully in combination with other mechanisms.

Long-running jobs

For batch or queue-based workloads, autoscaling on queue depth or job count can be effective.

Initialization and warm‑up time

If pods take a long time to become ready, HPA may lag behind spikes. Over-provisioning minReplicas or combining HPA with pre-warming strategies can help.

Also consider:

Startup and readiness probes

Ensure probes are correctly set so that newly scaled pods don’t receive traffic before they’re ready.

Graceful shutdown

When scaling down, ensure your application can drain and shutdown without losing work or data.

HPA and Other Scaling Mechanisms

HPA operates at the pod replica level and interacts with other scaling features:

Cluster autoscaler / node scaling

HPA may increase pod replicas beyond current cluster capacity.
A separate mechanism (e.g., cluster autoscaler) can then add nodes to satisfy the new demand.

Vertical pod autoscaling

HPA changes the number of pods.
Vertical mechanisms adjust resource requests for each pod.
Combining both requires careful tuning to avoid conflicts.

Manual scaling

If you manually change spec.replicas on the target controller, HPA will override it on its next reconciliation unless you disable or delete the HPA.

In OpenShift environments, teams often:

Start with manual scaling.
Introduce HPA on critical or user-facing services.
Later combine HPA with cluster autoscaling to achieve end‑to‑end elasticity.

Observability and Troubleshooting HPA

To operate HPA effectively, you must be able to see what it is doing and why.

Useful checks:

HPA status

oc describe hpa <name> for events and current metrics values.

Pod metrics

Validate that the metrics seen by HPA match your expectations.

Workload logs

Check if pods under high utilization correspond to user-visible slowdowns.

Common symptoms and their likely causes:

HPA never scales up

Metrics not available or not recognized (e.g., missing metrics adapter).
Target utilization too high compared to actual load.
Resource requests set too large, so utilization stays low.

HPA constantly scales up and down

Target metric too sensitive for bursty traffic.
Stabilization window too short or not configured.
Metrics noisy; consider smoothing or different metrics.

Pods scaled up but performance still poor

Bottleneck is elsewhere (database, external dependency, network).
Application not horizontally scalable (e.g., shared state, locking).

Instrumenting your application with appropriate metrics, and reviewing both HPA status and application performance, is key to tuning autoscaling.

Best Practices for Horizontal Pod Autoscaling in OpenShift

Begin with conservative bounds

Choose safe minReplicas to handle baseline traffic.
Set maxReplicas to a value your infrastructure can realistically support.

Align metrics with real performance

Use CPU as a starting point, but move to application-level metrics where possible.

Tune based on real data

Measure typical and peak load, then adjust:

Resource requests/limits
Target utilization
Min/max replicas

Avoid overreaction to short spikes

Use stabilization windows and appropriate target values.

Ensure graceful scaling

Correct probes.
Graceful shutdown behavior.
Idempotent request handling.

Validate in non-production environments

Test autoscaling under synthetic load to confirm behavior before enabling in production.

By combining these practices with OpenShift’s monitoring and logging capabilities, you can build applications that respond automatically and predictably to changing load using horizontal pod autoscaling.

Comments

Please login to add a comment.

Don't have an account? Register now!