16.5 Capacity planning

Table of Contents

Key Goals of Capacity Planning in OpenShift

Capacity planning in OpenShift is about ensuring that your cluster has enough compute, memory, storage, and network resources to meet current and future demands—without excessive over‑provisioning. In the context of operations, it connects:

Workload requirements (applications, CI/CD pipelines, batch jobs)
Platform overhead (control plane, Operators, monitoring, logging, registry)
Service-level objectives (SLOs) (availability, latency, throughput, time‑to‑scale)

In practice, capacity planning in OpenShift is iterative and data-driven, not a one‑time sizing exercise.

Key objectives:

Avoid resource saturation that leads to pod evictions, throttling, and failures.
Provide headroom for spikes, upgrades, and node failures.
Balance cost efficiency with reliability and performance.
Align infrastructure growth with application and business growth.

Capacity Planning Dimensions in OpenShift

Capacity must be considered across several dimensions that interplay with each other.

Compute (CPU and Memory)

Node sizing: Fewer large nodes vs more smaller nodes impacts:

Failure blast radius (how much workload is lost when one node fails)
Scheduling flexibility (bin-packing pods with different resource profiles)

CPU vs memory ratio:

Some clusters are CPU‑bound (e.g., API processing, microservices)
Others are memory‑bound (e.g., caching, in‑memory analytics)

Headroom:

Always reserve some percentage of aggregate CPU/memory for:

Control plane and system daemons
Spikes and rescheduling during failures
Upgrades and rolling node maintenance

A common operational starting point is to target ~60–70% sustained utilization at peak.

Storage

Capacity: Sizing total storage across:

Application data volumes
Internal registry
Logging and metrics storage
Backup and snapshot space

Performance:

IOPS and throughput needs for workloads (databases, message queues)
Latency requirements for stateful services

Classes of storage:

Multiple StorageClass types with different performance and cost tiers
Plan which workloads can use lower‑tier vs high‑performance storage

Network and Ingress

Bandwidth and throughput:

North–south: client traffic through Routes/Ingress/Load Balancers
East–west: service‑to‑service traffic within the cluster

Ingress capacity:

Number and size of router/ingress pods
External load balancer limits (connections, throughput)

Multi‑zone routing and failover:

Ensuring enough ingress capacity in each zone
Planning for zone failure scenarios

High Availability and Failure Scenarios

Capacity must be sufficient not just for normal operation, but also under failure:

Node failure: Can the remaining nodes run all critical pods when one or more nodes are lost?
Zone failure:

Are replicas and capacities distributed across availability zones?
Is there enough spare capacity in surviving zones?

Upgrade and maintenance windows:

During rolling node upgrades, workloads are temporarily moved.
Plan capacity so nodes can be drained without overloading others.

Inputs for Capacity Planning

Effective capacity planning combines design assumptions with real utilization data.

Workload Profiles

Gather or estimate:

Number of applications and replicas
Resource requests and limits per pod:

CPU (cores or millicores) and memory
Patterns of over‑ or under‑requesting

Traffic patterns:

Average vs peak RPS (requests per second)
Daily/weekly seasonality
Batch vs interactive traffic

Growth expectations:

Planned new services
Expected user growth and data growth
Upcoming campaigns, releases, or events

Platform Overhead and Operators

Account for:

Control plane components (API servers, controllers, etcd)
Core platform services:

Ingress/routers
Cluster DNS
Monitoring and logging stack
Internal image registry (if used)

Third‑party and custom Operators:

Each Operator may run multiple controllers and operand pods.
Some Operators manage resource‑heavy services (DBs, caches, message brokers).

Platform overhead typically ranges from a few percent to 20–30% of cluster resources, depending on how many platform services are hosted inside the cluster.

SLOs and Business Constraints

Tie capacity to non‑technical constraints:

Target availability (e.g., 99.9% vs 99.99%) affects redundancy requirements.
Performance SLOs (latency, throughput) influence how much headroom you maintain.
Budget and cost constraints define upper bounds on node counts and instance types.
Compliance may require data separation (e.g., separate nodes/regions), impacting utilization.

Estimating Cluster Size

A structured estimation process helps you move from requirements to an initial cluster design. You refine it later using monitoring data.

Step 1: Normalize Application Requirements

For each workload, consider its steady‑state resource requests (not limits), then add some buffer.

Example for CPU:

$$
\text{Total\_CPU\_requests} = \sum_{i=1}^{N} (\text{CPU\_request\_per\_pod\_} i \times \text{replicas\_} i)
$$

Do the same for memory. Then add a headroom factor, e.g., 30%:

$$
\text{Total\_CPU\_with\_buffer} = \text{Total\_CPU\_requests} \times 1.3
$$

Repeat for memory. Make sure to separate:

Critical workloads (must always run under failure)
Best‑effort or batch workloads (can be throttled or paused)

Step 2: Add Platform and System Overhead

Estimate capacity reserved for:

System daemons and kubelets
OpenShift core services
Monitoring, logging, and Operators

You can model this as a flat baseline (e.g., X cores and Y GB per node) or as a percentage of total node resources. For planning, you might assume something like:

$$
\text{Usable\_node\_CPU} = \text{Node\_CPU\_total} \times 0.8
$$

leaving 20% for overhead and spikes. Adjust based on your environment.

Step 3: Derive Node Count and Types

Given a node type with certain capacity, e.g.:

$C_{\text{node}}$: CPU cores per node
$M_{\text{node}}$: memory per node

And given usable fractions $f_{\text{CPU}}$ and $f_{\text{Mem}}$, your effective per‑node capacity is:

$$
C_{\text{usable}} = C_{\text{node}} \times f_{\text{CPU}}
$$
$$
M_{\text{usable}} = M_{\text{node}} \times f_{\text{Mem}}
$$

Then:

$$
\text{Node\_count\_CPU} = \left\lceil \frac{\text{Total\_CPU\_with\_buffer}}{C_{\text{usable}}} \right\rceil
$$
$$
\text{Node\_count\_Mem} = \left\lceil \frac{\text{Total\_Mem\_with\_buffer}}{M_{\text{usable}}} \right\rceil
$$

Take the maximum of the two as a starting node count.

Next, apply HA and failure constraints. For example, in an N+1 scheme:

Ensure that after losing one node, the remaining nodes still have enough capacity for all critical workloads.
This may require an additional node or more headroom.

If using multiple availability zones, perform this sizing per zone.

Step 4: Consider Multiple Node Pools and Special Hardware

Plan distinct machine sets / node pools for:

General purpose workloads
Compute‑optimized or memory‑optimized workloads
Specialized nodes:

GPU nodes for ML/AI or HPC
Storage‑heavy nodes for data services

Each pool can be sized independently, based on the subset of workloads that may land on that pool, and governed by labels and nodeSelector/affinity rules.

Using Quotas, Requests, and Limits for Capacity Management

Capacity planning is only effective when you control how workloads consume cluster resources.

Enforcing Resource Requests

The scheduler uses requests, not limits, to decide placement.
Encourage or enforce realistic requests via:

Default LimitRange for namespaces
Admission policies that reject pods without requests

Under‑requested workloads can cause noisy neighbor problems and unpredictable performance.

Namespaces, Quotas, and Fair Sharing

Use ResourceQuota and LimitRange to align tenant expectations with cluster capacity:

Set per‑project caps for:

Total CPU/memory requests and limits
Number of pods
PersistentVolumeClaims

Reserve capacity for critical namespaces and platform services.
Prevent a single team from consuming all available resources.

Pod Disruption and Node Drains

During maintenance and failures, pods are rescheduled:

Ensure enough spare capacity so that when nodes are drained (for upgrades or repairs), other nodes can absorb their workloads.
Use Pod Disruption Budgets (PDBs) to control how many replicas can be unavailable but recognize that PDBs require sufficient capacity to be effective.

Storage Capacity Planning

Storage is often a bottleneck if not planned carefully.

Sizing for Application and Platform Data

Identify and estimate:

Persistent Volumes (PVs) for application data:

Current usage and expected growth (per month/quarter)

Metrics and logging:

Retention period (e.g., 7 days vs 30 days) has a huge impact on storage.
Sampling rate and cardinality (number of metrics/labels).

Registry data:

Number and size of images
Image retention policies (pruning strategy)

Plan total capacity as:

$$
\text{Total\_storage} = \text{App\_data} + \text{Logs\_and\_metrics} + \text{Registry} + \text{Backups\_and\_snapshots} + \text{Buffer}
$$

Performance Classes and Workload Mapping

Assign workloads to appropriate StorageClass types:

High IOPS/low latency for databases and message brokers.
Standard performance for typical stateless services with occasional persistence.
Lower‑tier or object storage for logs, archives, and backups.

Map expected IOPS and throughput to the underlying storage system’s capabilities, ensuring:

You stay within provider or array limits.
You understand how many volumes per node and per workload you can support.

Network and Ingress Capacity

Network planning focuses on bandwidth, latency, and connection limits.

Ingress and Router Scaling

Consider:

Peak external request rate and bandwidth across all Routes/Ingress.
Number of router/ingress pods and their resource requests.
How routers are spread across nodes and zones.

To scale:

Use horizontal scaling (more router pods) rather than overly large single pods.
Use multiple ingress controllers (e.g., internal vs external traffic) with distinct capacity plans.

East–West Traffic and Service Mesh

If using a service mesh or heavy inter‑service traffic:

Account for sidecar overhead (CPU/memory) on each pod.
Consider additional network latency and throughput requirements.
Measure and size based on typical and worst‑case call graphs.

Growth, Scaling Strategies, and Automation

Capacity planning should define not just current size, but how the cluster will grow.

Organic Growth vs Stepwise Expansion

Organic growth:

Add nodes incrementally as utilization approaches a threshold (e.g., 60–70%).
Requires continuous monitoring and quick procurement or autoscaling.

Stepwise expansion:

Pre‑plan periodic expansions (e.g., quarterly) based on projected demand.
Useful when procurement or change management processes are slow.

Cluster Autoscaling

If using cluster autoscaler with cloud‑based infrastructure:

Define min/max node counts per machine pool.
Ensure there is:

Enough quota in the cloud provider.
Budgetary allowance for possible high‑water marks.

Align pod requests with autoscaling behavior:

Overly large pods may block scaling if no node size can fit them.
Many tiny pods may lead to fragmentation and inefficient bin packing.

Multi‑Cluster and Multi‑Tenant Strategies

For large environments:

Consider multiple clusters to:

Isolate environments (dev/test/prod, or different business units).
Separate highly specialized workloads (e.g., GPU/HPC vs general apps).

Capacity planning spans:

Per‑cluster sizing
Cross‑cluster routing and failover
Shared services (central logging, monitoring, CI/CD) that may have their own capacity plans.

Monitoring‑Driven Capacity Planning

Capacity planning is not static; OpenShift’s observability stack is central to continuous improvement.

Key Metrics to Track

At cluster and node level:

CPU and memory utilization (average, p95, p99)
Pod scheduling failures or pending pods due to insufficient resources
Node pressure conditions (memory, disk, PID)
Evictions and OOM kills
Storage usage per volume, per application, and per storage backend
Ingress/egress traffic volumes and router saturation

Capacity Dashboards and Alerts

Use or build dashboards that show:

Headroom trends over time (weeks/months)
Forecasted dates when utilization will hit thresholds (e.g., 80%).
Hot spots: nodes or namespaces regularly near saturation.

Set alerts for:

Sustained high utilization beyond planned thresholds.
Rapid growth in storage usage or network throughput.
Frequent pod evictions or FailedScheduling events.

These signals feed back into:

Decisions to add nodes or new clusters.
Adjustments to resource requests/limits and quotas.
Changes in logging/metrics retention.

Operational Practices and Review Cycles

Capacity planning is part of ongoing operations, not a one‑off exercise.

Regular reviews (e.g., monthly or quarterly):

Compare actual usage against planned capacity.
Validate growth assumptions and adjust forecasts.

Change impact assessments:

Analyze the capacity impact of major application deployments, new Operators, or big configuration changes.

Pre‑upgrade checks:

Before major OpenShift upgrades, verify enough spare capacity to handle:

Node drains and rolling restarts.
Any additional resource needs of the new release.

Documentation and communication:

Document capacity assumptions, thresholds, and SLOs.
Communicate with application teams about quotas and expected scaling behavior.

By treating capacity planning as a continuous, metrics‑driven process tightly integrated with OpenShift operations, you maintain a platform that is both reliable and cost‑efficient, and that can evolve predictably as workloads and business needs grow.

Comments

Please login to add a comment.

Don't have an account? Register now!