Table of Contents
Core ideas of high availability
In the context of OpenShift (and Kubernetes in general), high availability (HA) is about designing the platform and applications so that they can tolerate failures with minimal disruption to users.
Key characteristics of a highly available system:
- Redundancy – multiple instances of critical components so one failure does not cause an outage.
- Fault isolation – failures are contained and do not cascade.
- Fast detection and recovery – failures are noticed quickly, and traffic is redirected or workloads are restarted automatically.
- No (or minimal) single points of failure (SPOF) – infrastructure is designed so that the failure of any one component does not take down the service.
In OpenShift, HA must be considered at multiple layers: infrastructure, platform components, and applications deployed on top.
Availability, reliability, and SLAs
It is useful to distinguish related concepts:
- Availability – proportion of time a service is up and functioning:
$$
\text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}}
$$
When expressed as a percentage, it leads to the common “number of nines” terminology:
- 99.0% – ~3.65 days of downtime/year
- 99.9% – ~8.76 hours/year
- 99.99% – ~52.6 minutes/year
- 99.999% – ~5.26 minutes/year
- Reliability – probability that a system will perform correctly for a given period under stated conditions.
- SLA / SLO – organizational targets (Service Level Agreements/Objectives) that define acceptable availability and performance.
HA design in OpenShift clusters aims at meeting specific SLOs, e.g. “99.9% availability for critical applications.”
High availability building blocks in OpenShift
Many platform features are covered in other chapters; here we focus on how they contribute conceptually to HA.
Redundancy through replication
At different levels:
- Control plane – multiple API servers, etcd members, and controllers (in a production HA cluster) so that loss of a single node does not stop the cluster.
- Worker nodes – multiple nodes so workloads can be distributed and rescheduled in case of node failure.
- Pods – multiple replicas of an application behind a Service. One pod failing does not cause a full outage.
Conceptually, redundancy turns a potentially fatal component failure into a capacity reduction event.
Failover and leader election
HA often requires an active component and one or more backups. In OpenShift/Kubernetes, this is often implemented via:
- Leader election among control plane components and some Operators.
- Automatic failover where another replica or node takes over once the current leader fails or becomes unreachable.
From an HA perspective, the key idea is: failover must be automatic and fast enough that it meets your SLOs.
Load balancing as an HA mechanism
While load balancing is mainly about distributing traffic, it is also a core HA tool:
- Health checks – load balancers route traffic only to healthy endpoints.
- Multiple backends – if one pod or node fails, traffic automatically shifts to others.
- Multiple load balancer instances – to avoid the load balancer itself becoming a SPOF.
In OpenShift, HA typically involves external or cloud-provided load balancers plus Service objects and routing components.
Self-healing as availability protection
Self-healing features discussed elsewhere (e.g. pod restarts, rescheduling) are central to HA:
- Pod restarts – if a pod crashes, it is restarted.
- Rescheduling – if a node disappears, workloads are moved elsewhere.
- Reconciliation loops – controllers continuously work to bring the actual state back to the desired state, reducing manual recovery actions.
From an HA perspective, self-healing transforms transient failures into short, often unnoticed, disruptions.
Avoiding single points of failure (SPOFs)
Designing for HA means systematically identifying and removing SPOFs at every layer.
Typical areas to consider in an OpenShift-based environment:
- Network – single router, single firewall, or single internet uplink can be SPOFs.
- Storage – single storage array or file server; if it fails, all dependent workloads fail.
- Control plane node – non-HA clusters with a single control plane node are SPOFs by design.
- Critical Operators – a single instance managing vital resources can be a SPOF if not designed with redundancy.
- CI/CD and tooling – a single pipeline runner or artifact repository can be a SPOF in delivery processes.
Conceptually, HA design asks for at least two independent, tested paths for every critical function.
High availability across failure domains
To reason about HA, it is important to think in terms of failure domains—units that can fail together.
Common failure domains:
- Pod – process or container crash.
- Node – hardware failure, OS crash, or node-level misconfiguration.
- Rack / power domain – power or ToR switch outage.
- Availability zone (AZ) – datacenter or cloud zone failure.
- Region – large-scale geographic failure.
HA strategies differ by domain:
- Protecting against pod failures – use multiple pod replicas across nodes.
- Protecting against node failures – use multiple nodes and anti-affinity rules.
- Protecting against AZ failures – distribute nodes and workloads across zones.
- Protecting against region failures – use multiple clusters in different regions, with DNS or traffic management across them.
The core idea is distribution of critical components across independent failure domains, balanced with latency and cost.
Application-level high availability concepts
Even with a highly available platform, applications must be designed for HA.
Key conceptual patterns:
- Stateless vs stateful – stateless services are much easier to make highly available; stateful ones require careful storage and replication strategies.
- Idempotent, retry-safe operations – clients can retry failed requests without causing incorrect behavior.
- Graceful degradation – partial failure leads to reduced functionality, not a total outage (e.g. read-only mode when writes are unavailable).
- Circuit breakers and timeouts – prevent cascading failures when one service becomes slow or unreachable.
- Health probes – readiness and liveness checks that accurately represent the application’s ability to serve traffic.
From an HA standpoint, application behavior under failure is as important as platform redundancy.
Trade-offs in high availability design
High availability is not free; it involves trade-offs:
- Cost vs availability – more nodes, zones, clusters, and storage replication increase cost.
- Complexity vs reliability – sophisticated HA topologies can themselves introduce configuration errors.
- Consistency vs availability – especially for stateful workloads, stronger data consistency can conflict with higher availability (related to CAP theorem).
- Performance vs distribution – spreading across zones/regions increases latency.
Conceptually, HA requires choosing an acceptable balance that aligns with business needs, rather than “maxing out” availability in every dimension.
HA patterns for OpenShift-based environments
While detailed implementation appears in other chapters, the main conceptual patterns are:
- N-way redundant control plane – multiple control plane nodes, typically distributed across zones.
- Redundant ingress layer – multiple routing/ingress components and external load balancers.
- Multi-node worker pools – workloads spread across nodes, possibly with separate pools for different types of workloads.
- Multi-AZ clusters – nodes in multiple availability zones with zonal-aware scheduling.
- Multi-cluster topologies – active/active or active/passive across regions or datacenters for disaster-level HA.
Understanding these patterns conceptually helps you evaluate which level of HA is appropriate for a given application or environment.
Measuring and validating availability
Conceptual HA design must be backed by measurement and testing:
- Uptime metrics – track actual availability against SLOs.
- Error rates and latency – availability is not just “up or down”; a degraded but technically reachable service may still violate SLOs.
- Failure drills (chaos testing) – intentionally removing nodes, zones, or components to verify that failover and recovery behave as expected.
- Runbooks and processes – documented procedures for handling failures that cannot be fully automated.
High availability is an ongoing practice, not a one-time configuration task; continuous validation is essential to keep real-world availability close to design goals.