Table of Contents
Key Goals of Capacity Planning in OpenShift
Capacity planning in OpenShift is about ensuring that your cluster has enough compute, memory, storage, and network resources to meet current and future demands—without excessive over‑provisioning. In the context of operations, it connects:
- Workload requirements (applications, CI/CD pipelines, batch jobs)
- Platform overhead (control plane, Operators, monitoring, logging, registry)
- Service-level objectives (SLOs) (availability, latency, throughput, time‑to‑scale)
In practice, capacity planning in OpenShift is iterative and data-driven, not a one‑time sizing exercise.
Key objectives:
- Avoid resource saturation that leads to pod evictions, throttling, and failures.
- Provide headroom for spikes, upgrades, and node failures.
- Balance cost efficiency with reliability and performance.
- Align infrastructure growth with application and business growth.
Capacity Planning Dimensions in OpenShift
Capacity must be considered across several dimensions that interplay with each other.
Compute (CPU and Memory)
- Node sizing: Fewer large nodes vs more smaller nodes impacts:
- Failure blast radius (how much workload is lost when one node fails)
- Scheduling flexibility (bin-packing pods with different resource profiles)
- CPU vs memory ratio:
- Some clusters are CPU‑bound (e.g., API processing, microservices)
- Others are memory‑bound (e.g., caching, in‑memory analytics)
- Headroom:
- Always reserve some percentage of aggregate CPU/memory for:
- Control plane and system daemons
- Spikes and rescheduling during failures
- Upgrades and rolling node maintenance
- A common operational starting point is to target ~60–70% sustained utilization at peak.
Storage
- Capacity: Sizing total storage across:
- Application data volumes
- Internal registry
- Logging and metrics storage
- Backup and snapshot space
- Performance:
- IOPS and throughput needs for workloads (databases, message queues)
- Latency requirements for stateful services
- Classes of storage:
- Multiple
StorageClasstypes with different performance and cost tiers - Plan which workloads can use lower‑tier vs high‑performance storage
Network and Ingress
- Bandwidth and throughput:
- North–south: client traffic through Routes/Ingress/Load Balancers
- East–west: service‑to‑service traffic within the cluster
- Ingress capacity:
- Number and size of router/ingress pods
- External load balancer limits (connections, throughput)
- Multi‑zone routing and failover:
- Ensuring enough ingress capacity in each zone
- Planning for zone failure scenarios
High Availability and Failure Scenarios
Capacity must be sufficient not just for normal operation, but also under failure:
- Node failure: Can the remaining nodes run all critical pods when one or more nodes are lost?
- Zone failure:
- Are replicas and capacities distributed across availability zones?
- Is there enough spare capacity in surviving zones?
- Upgrade and maintenance windows:
- During rolling node upgrades, workloads are temporarily moved.
- Plan capacity so nodes can be drained without overloading others.
Inputs for Capacity Planning
Effective capacity planning combines design assumptions with real utilization data.
Workload Profiles
Gather or estimate:
- Number of applications and replicas
- Resource requests and limits per pod:
- CPU (cores or millicores) and memory
- Patterns of over‑ or under‑requesting
- Traffic patterns:
- Average vs peak RPS (requests per second)
- Daily/weekly seasonality
- Batch vs interactive traffic
- Growth expectations:
- Planned new services
- Expected user growth and data growth
- Upcoming campaigns, releases, or events
Platform Overhead and Operators
Account for:
- Control plane components (API servers, controllers, etcd)
- Core platform services:
- Ingress/routers
- Cluster DNS
- Monitoring and logging stack
- Internal image registry (if used)
- Third‑party and custom Operators:
- Each Operator may run multiple controllers and operand pods.
- Some Operators manage resource‑heavy services (DBs, caches, message brokers).
Platform overhead typically ranges from a few percent to 20–30% of cluster resources, depending on how many platform services are hosted inside the cluster.
SLOs and Business Constraints
Tie capacity to non‑technical constraints:
- Target availability (e.g., 99.9% vs 99.99%) affects redundancy requirements.
- Performance SLOs (latency, throughput) influence how much headroom you maintain.
- Budget and cost constraints define upper bounds on node counts and instance types.
- Compliance may require data separation (e.g., separate nodes/regions), impacting utilization.
Estimating Cluster Size
A structured estimation process helps you move from requirements to an initial cluster design. You refine it later using monitoring data.
Step 1: Normalize Application Requirements
For each workload, consider its steady‑state resource requests (not limits), then add some buffer.
Example for CPU:
$$
\text{Total\_CPU\_requests} = \sum_{i=1}^{N} (\text{CPU\_request\_per\_pod\_} i \times \text{replicas\_} i)
$$
Do the same for memory. Then add a headroom factor, e.g., 30%:
$$
\text{Total\_CPU\_with\_buffer} = \text{Total\_CPU\_requests} \times 1.3
$$
Repeat for memory. Make sure to separate:
- Critical workloads (must always run under failure)
- Best‑effort or batch workloads (can be throttled or paused)
Step 2: Add Platform and System Overhead
Estimate capacity reserved for:
- System daemons and kubelets
- OpenShift core services
- Monitoring, logging, and Operators
You can model this as a flat baseline (e.g., X cores and Y GB per node) or as a percentage of total node resources. For planning, you might assume something like:
$$
\text{Usable\_node\_CPU} = \text{Node\_CPU\_total} \times 0.8
$$
leaving 20% for overhead and spikes. Adjust based on your environment.
Step 3: Derive Node Count and Types
Given a node type with certain capacity, e.g.:
- $C_{\text{node}}$: CPU cores per node
- $M_{\text{node}}$: memory per node
And given usable fractions $f_{\text{CPU}}$ and $f_{\text{Mem}}$, your effective per‑node capacity is:
$$
C_{\text{usable}} = C_{\text{node}} \times f_{\text{CPU}}
$$
$$
M_{\text{usable}} = M_{\text{node}} \times f_{\text{Mem}}
$$
Then:
$$
\text{Node\_count\_CPU} = \left\lceil \frac{\text{Total\_CPU\_with\_buffer}}{C_{\text{usable}}} \right\rceil
$$
$$
\text{Node\_count\_Mem} = \left\lceil \frac{\text{Total\_Mem\_with\_buffer}}{M_{\text{usable}}} \right\rceil
$$
Take the maximum of the two as a starting node count.
Next, apply HA and failure constraints. For example, in an N+1 scheme:
- Ensure that after losing one node, the remaining nodes still have enough capacity for all critical workloads.
- This may require an additional node or more headroom.
If using multiple availability zones, perform this sizing per zone.
Step 4: Consider Multiple Node Pools and Special Hardware
Plan distinct machine sets / node pools for:
- General purpose workloads
- Compute‑optimized or memory‑optimized workloads
- Specialized nodes:
- GPU nodes for ML/AI or HPC
- Storage‑heavy nodes for data services
Each pool can be sized independently, based on the subset of workloads that may land on that pool, and governed by labels and nodeSelector/affinity rules.
Using Quotas, Requests, and Limits for Capacity Management
Capacity planning is only effective when you control how workloads consume cluster resources.
Enforcing Resource Requests
- The scheduler uses
requests, notlimits, to decide placement. - Encourage or enforce realistic requests via:
- Default
LimitRangefor namespaces - Admission policies that reject pods without requests
- Under‑requested workloads can cause noisy neighbor problems and unpredictable performance.
Namespaces, Quotas, and Fair Sharing
Use ResourceQuota and LimitRange to align tenant expectations with cluster capacity:
- Set per‑project caps for:
- Total CPU/memory requests and limits
- Number of pods
- PersistentVolumeClaims
- Reserve capacity for critical namespaces and platform services.
- Prevent a single team from consuming all available resources.
Pod Disruption and Node Drains
During maintenance and failures, pods are rescheduled:
- Ensure enough spare capacity so that when nodes are drained (for upgrades or repairs), other nodes can absorb their workloads.
- Use Pod Disruption Budgets (PDBs) to control how many replicas can be unavailable but recognize that PDBs require sufficient capacity to be effective.
Storage Capacity Planning
Storage is often a bottleneck if not planned carefully.
Sizing for Application and Platform Data
Identify and estimate:
- Persistent Volumes (PVs) for application data:
- Current usage and expected growth (per month/quarter)
- Metrics and logging:
- Retention period (e.g., 7 days vs 30 days) has a huge impact on storage.
- Sampling rate and cardinality (number of metrics/labels).
- Registry data:
- Number and size of images
- Image retention policies (pruning strategy)
Plan total capacity as:
$$
\text{Total\_storage} = \text{App\_data} + \text{Logs\_and\_metrics} + \text{Registry} + \text{Backups\_and\_snapshots} + \text{Buffer}
$$
Performance Classes and Workload Mapping
Assign workloads to appropriate StorageClass types:
- High IOPS/low latency for databases and message brokers.
- Standard performance for typical stateless services with occasional persistence.
- Lower‑tier or object storage for logs, archives, and backups.
Map expected IOPS and throughput to the underlying storage system’s capabilities, ensuring:
- You stay within provider or array limits.
- You understand how many volumes per node and per workload you can support.
Network and Ingress Capacity
Network planning focuses on bandwidth, latency, and connection limits.
Ingress and Router Scaling
Consider:
- Peak external request rate and bandwidth across all Routes/Ingress.
- Number of router/ingress pods and their resource requests.
- How routers are spread across nodes and zones.
To scale:
- Use horizontal scaling (more router pods) rather than overly large single pods.
- Use multiple ingress controllers (e.g., internal vs external traffic) with distinct capacity plans.
East–West Traffic and Service Mesh
If using a service mesh or heavy inter‑service traffic:
- Account for sidecar overhead (CPU/memory) on each pod.
- Consider additional network latency and throughput requirements.
- Measure and size based on typical and worst‑case call graphs.
Growth, Scaling Strategies, and Automation
Capacity planning should define not just current size, but how the cluster will grow.
Organic Growth vs Stepwise Expansion
- Organic growth:
- Add nodes incrementally as utilization approaches a threshold (e.g., 60–70%).
- Requires continuous monitoring and quick procurement or autoscaling.
- Stepwise expansion:
- Pre‑plan periodic expansions (e.g., quarterly) based on projected demand.
- Useful when procurement or change management processes are slow.
Cluster Autoscaling
If using cluster autoscaler with cloud‑based infrastructure:
- Define min/max node counts per machine pool.
- Ensure there is:
- Enough quota in the cloud provider.
- Budgetary allowance for possible high‑water marks.
- Align pod
requestswith autoscaling behavior: - Overly large pods may block scaling if no node size can fit them.
- Many tiny pods may lead to fragmentation and inefficient bin packing.
Multi‑Cluster and Multi‑Tenant Strategies
For large environments:
- Consider multiple clusters to:
- Isolate environments (dev/test/prod, or different business units).
- Separate highly specialized workloads (e.g., GPU/HPC vs general apps).
- Capacity planning spans:
- Per‑cluster sizing
- Cross‑cluster routing and failover
- Shared services (central logging, monitoring, CI/CD) that may have their own capacity plans.
Monitoring‑Driven Capacity Planning
Capacity planning is not static; OpenShift’s observability stack is central to continuous improvement.
Key Metrics to Track
At cluster and node level:
- CPU and memory utilization (average, p95, p99)
- Pod scheduling failures or pending pods due to insufficient resources
- Node pressure conditions (memory, disk, PID)
- Evictions and OOM kills
- Storage usage per volume, per application, and per storage backend
- Ingress/egress traffic volumes and router saturation
Capacity Dashboards and Alerts
Use or build dashboards that show:
- Headroom trends over time (weeks/months)
- Forecasted dates when utilization will hit thresholds (e.g., 80%).
- Hot spots: nodes or namespaces regularly near saturation.
Set alerts for:
- Sustained high utilization beyond planned thresholds.
- Rapid growth in storage usage or network throughput.
- Frequent pod evictions or FailedScheduling events.
These signals feed back into:
- Decisions to add nodes or new clusters.
- Adjustments to resource requests/limits and quotas.
- Changes in logging/metrics retention.
Operational Practices and Review Cycles
Capacity planning is part of ongoing operations, not a one‑off exercise.
- Regular reviews (e.g., monthly or quarterly):
- Compare actual usage against planned capacity.
- Validate growth assumptions and adjust forecasts.
- Change impact assessments:
- Analyze the capacity impact of major application deployments, new Operators, or big configuration changes.
- Pre‑upgrade checks:
- Before major OpenShift upgrades, verify enough spare capacity to handle:
- Node drains and rolling restarts.
- Any additional resource needs of the new release.
- Documentation and communication:
- Document capacity assumptions, thresholds, and SLOs.
- Communicate with application teams about quotas and expected scaling behavior.
By treating capacity planning as a continuous, metrics‑driven process tightly integrated with OpenShift operations, you maintain a platform that is both reliable and cost‑efficient, and that can evolve predictably as workloads and business needs grow.