Table of Contents
Key concepts: zones, regions, and failure domains
In OpenShift (and Kubernetes), multi-zone and multi-region are about how you place cluster components across independent failure domains so that:
- A single physical failure (rack, power domain, data center, or region) does not take down the whole platform.
- Workloads can either survive failures transparently or fail over in a controlled way.
Basic terms used in OpenShift scheduling and infrastructure:
- Zone
A fault domain inside a region (e.g., AWSus-east-1a, GCPeurope-west1-b). Zones usually share: - Low-latency, high-bandwidth network
- Common control plane endpoints
But have independent power, cooling, and physical infrastructure. - Region
A geographically separated data center area (e.g., AWSus-east-1vsus-west-2, on-prem “DC-East” vs “DC-West”). Regions typically have: - Higher latency between each other
- Independent networking and power infrastructures
- Separate sets of zones
- Failure domain
A logical label representing a “thing that can fail together”. OpenShift commonly uses: topology.kubernetes.io/regiontopology.kubernetes.io/zone
These topology labels are attached to nodes (or machine sets) and are used for scheduling and spreading workloads.
Multi-zone clusters
A multi-zone OpenShift cluster has worker (and in some modes, control plane) nodes spread across multiple availability zones within a single region.
Goals:
- Keep the cluster available if one zone fails.
- Spread workloads to avoid “all eggs in one basket”.
- Keep latency low (zones within a region usually have fast links).
Control plane placement in multi-zone setups
OpenShift requires an odd number of control plane nodes for etcd quorum (commonly 3).
Patterns:
- 3 control plane nodes, each in a different zone
- Recommended for cloud IPI deployments where infrastructure supports zonal distribution.
- If one zone fails, etcd still has 2/3 nodes and quorum is preserved.
- 3 control plane nodes all in one zone
- Simpler, but that zone is a single point of failure for the cluster.
- Sometimes used in on-prem or constrained scenarios, but not ideal for HA.
In multi-zone production environments, distributing control plane nodes across zones is the typical approach when supported.
Worker node distribution and machine sets
Workers are usually created using MachineSets (in IPI or when using Machine API). For multi-zone:
- One MachineSet per zone, with:
- Node labels like:
topology.kubernetes.io/region=<region-name>topology.kubernetes.io/zone=<zone-name>- Possibly additional labels to control workload placement (e.g.,
node-role.kubernetes.io/worker).
Example idea (no full YAML here, just the concepts):
machineset-worker-zone-amachineset-worker-zone-bmachineset-worker-zone-c
Each MachineSet scales independently, so you can adjust capacity per zone.
Topology-aware scheduling and spreading
To achieve resilience across zones, the cluster and workloads use:
- Pod anti-affinity / topology spread constraints
So replicas of the same application are distributed across zones rather than landing in a single one. - Default topology spread (in newer Kubernetes/OpenShift versions)
The scheduler attempts to spread pods across zones and nodes, reducing skew without requiring per-app config.
Key mechanisms:
topologySpreadConstraints:- Spread replicas across:
topologyKey: topology.kubernetes.io/zone- Possibly also
topology.kubernetes.io/region - PodDisruptionBudgets:
- Ensure operations (like node drains) don’t evict too many replicas in a given zone or overall.
Effect: if a whole zone becomes unavailable, remaining zones still run enough replicas (if you planned replica counts and resource capacity correctly).
Storage in multi-zone clusters
Multi-zone + storage is where practical constraints show up clearly:
- Zonal storage (common case in cloud):
Persistent volumes are tied to a single zone: - If a node in another zone schedules a pod that needs that PV, it likely can’t attach it.
- For stateful workloads with zone-local storage:
- Pods are effectively constrained to that same zone.
- You need node affinity or topology-aware provisioning to keep pods and PVs in the same zone.
- Regionally replicated storage:
Some storage backends provide: - Synchronous or asynchronous replication across zones
- Volume classes that can be attached from any zone
- Higher cost, but better resilience.
On OpenShift, this is managed through:
- StorageClasses with:
- Zonal topology or
- Multi-zone/replicated capabilities.
- Dynamic provisioning that understands topology labels.
For HA designs, you must align:
- PVC access modes and storage topology
- Pod scheduling rules
- Expected zone-failure behavior (failover vs per-zone isolation)
Load balancing and ingress in multi-zone environments
In a multi-zone cluster within a single region:
- A single cluster endpoint (e.g., cloud load balancer) usually fronts all API servers across zones.
- For applications:
- Ingress Controllers (using OpenShift Router) run on nodes in multiple zones.
- The cloud load balancer (or external LB) spreads traffic across routers in different zones.
Considerations:
- If one zone fails:
- API load balancer fails over to remaining zones’ control plane nodes.
- Application ingresses continue to serve from routers in surviving zones.
- Health checks and LB config must:
- Detect unavailable nodes/routers.
- Avoid sending traffic to failed zones.
Multi-region architectures with OpenShift
A multi-region design introduces higher latency, more complexity, and more independence between sites.
Core distinction:
- Multi-zone: usually one cluster across multiple zones within a region.
- Multi-region: usually multiple clusters, one per region, coordinated at a higher layer.
Running a single OpenShift cluster stretched across multiple regions is typically not recommended due to:
- etctd and control plane latency requirements
- Network reliability between regions
- Operational complexity and failure modes
Instead, common patterns are:
- Multiple clusters, one per region
- Workload-level mechanisms for:
- Traffic routing
- Data replication
- Failover and recovery
Multi-region topologies
Typical patterns:
- Active–active across regions
- At least two independent OpenShift clusters, each in its own region.
- Both clusters serve real user traffic.
- Data is replicated between regions (synchronous or asynchronous, depending on requirements).
- Global traffic is routed using:
- DNS-based routing (e.g., weighted or latency-based)
- Global load balancers
- Pros:
- High availability
- Lower latency for users near each region
- Cons:
- Complex data consistency and conflict resolution
- Cost (duplicate infrastructure)
- Active–passive (cold or warm standby)
- One primary region/cluster serves traffic (active).
- One or more secondary regions/clusters are on standby:
- Cold standby: minimal or no resources running; spin up on failover.
- Warm standby: scaled-down version of production running, can scale up quickly.
- Data is replicated from active to passive:
- Typically asynchronous to avoid impacting primary latency.
- Failover is:
- DNS or LB change to direct traffic to backup region.
- Requires tested runbooks or automation.
- Pros:
- Simpler data flows
- Less risk of data conflicts
- Cons:
- Failover time
- Possible data loss window (RPO > 0)
- Regional isolation with shared services
- Each region has its own “self-contained” OpenShift cluster for local workloads.
- Some shared platform services or data platforms may be global or regionally replicated.
- Often used in:
- Regulatory or data-sovereignty-driven environments
- Enterprises with region-specific workloads but shared tooling.
Cross-cluster coordination: traffic and service discovery
Since multi-region often means multi-cluster, you need a way to:
- Expose applications across regions
- Failover between clusters
- Potentially route users to the “closest” or “healthiest” region
Common mechanisms:
- DNS-based routing
- Multiple
A/CNAMErecords pointing to region-specific ingress endpoints. - Policies:
- Latency-based routing
- Geolocation-based routing
- Weighted distribution
- Failover (primary + secondary)
- Advantages: simple, widely supported.
- Drawbacks: DNS caching can delay failover.
- Global/Anycast load balancers
- Global front-end IP address that routes traffic to:
- Ingress endpoints in different regions.
- L7 health checks to route traffic away from unhealthy regions.
- Service mesh and multi-cluster gateways
- Mesh implementations can:
- Expose services across clusters
- Provide fine-grained traffic shifting (e.g., canary across regions)
- Useful for advanced scenarios but adds complexity.
In all cases, cluster-level ingress (OpenShift Routes/IngressControllers) operate per cluster; global routing layers sit above them.
Data management and replication
Data is usually the hardest part of multi-region design. Approaches:
- Stateless services
Easiest: deploy identical stateless workloads in each region; rely on: - Shared global services (e.g., object storage)
- Or region-local data that is allowed to be diverged
- Database-level replication
- Multi-region HA is pushed down to:
- Database replication (e.g., primary–replica, multi-primary, or cluster replication).
- OpenShift clusters host application pods that connect to these replicated databases.
- Choices affect:
- RPO (data loss tolerance)
- RTO (time to switch regions)
- Consistency guarantees (strong vs eventual).
- Storage-system replication
- Some storage systems provide:
- Region-to-region replication of volumes or object stores.
- Applications might:
- Fail over to replicated volumes
- Read from local copies and handle potential staleness.
Application design must be aware of:
- Whether a region might lose last transactions on failover.
- How to handle data conflicts (in active-active).
- How to resync after a region comes back online.
Designing applications for multi-zone vs multi-region
The same application might behave differently depending on deployment pattern.
Multi-zone readiness
For multi-zone resilience, applications should:
- Run with enough replicas to survive a zone loss:
- Example: 3 zones → at least 3 replicas spread 1 per zone.
- Use topology spreading or anti-affinity to:
- Avoid all replicas landing in one zone.
- For stateful workloads:
- Understand storage topology:
- If PV is zone-local, ensure pods are scheduled in that zone.
- If using replicated/multi-zone storage, still consider performance impacts.
Applications in multi-zone setups normally expect:
- Short network latencies between zones.
- Shared cluster control plane and API.
Multi-region readiness
For multi-region, applications should additionally:
- Be deployable in multiple clusters with:
- Same configuration templates (e.g., GitOps).
- Environment-specific values (endpoints, storage classes).
- Handle:
- Different latency profiles per region.
- Region-specific failures and partial outages.
- Be explicit about:
- Data ownership per region (e.g., sharding by geography vs full replication).
- Consistency requirements when switching regions.
Patterns:
- Read local, write primary:
- Reads served in each region.
- Writes go to a single primary region.
- Failover involves promoting a new primary.
- Region-local write with eventual consistency:
- Regions can accept writes locally.
- Background replication synchronizes changes.
- Conflicts handled by app logic or database conflict resolution.
Operational aspects and trade-offs
Multi-zone and multi-region HA bring operational challenges that must be balanced with cost and complexity.
Cost vs availability
- Multi-zone:
- Additional nodes and LBs per zone, but still one cluster.
- Often considered a baseline for production cloud deployments.
- Multi-region:
- Doubling (or more) of infrastructure.
- Potential duplication of stateful services and platform components.
- Increased operational overhead (more clusters to manage).
Decision factors:
- Required SLA (uptime, RTO, RPO).
- Regulatory or latency-driven regional needs.
- Budget and team expertise.
Operational complexity
With multi-zone:
- Complexity mainly around:
- Node distribution
- Storage topology
- Ensuring workloads are properly spread
With multi-region:
- Additionally, you must manage:
- Multiple clusters’ lifecycle (upgrades, capacity, security).
- Global traffic routing strategy and automation.
- Data replication and failover procedures.
- Runbooks for partial-region failure, full-region failure, and “split-brain” scenarios.
Testing is crucial:
- Regularly simulate:
- Loss of a single zone
- Loss of control plane node(s)
- Entire region outage (in non-production or carefully controlled environments)
Summary of when to choose which design
- Single-region, multi-zone cluster:
- Preferred for:
- Most standard production workloads.
- Where regional outages are rare or acceptable risks.
- Pros:
- Simpler operations.
- Single control plane.
- Good protection against many kinds of failures.
- Multi-region, multi-cluster setup:
- Consider when:
- You must survive total loss of a region.
- You need low-latency access from distant geographies.
- Regulations require regional segregation.
- Pros:
- Higher resilience against large-scale failures.
- Better per-region latency.
- Cons:
- Significant complexity and cost.
- Non-trivial data consistency challenges.
An effective OpenShift HA strategy usually starts with solid multi-zone design, and only extends to multi-region when business continuity, geography, or compliance requirements clearly justify the added complexity.