11.5 Multi-zone and multi-region clusters

Table of Contents

Key concepts: zones, regions, and failure domains

In OpenShift (and Kubernetes), multi-zone and multi-region are about how you place cluster components across independent failure domains so that:

A single physical failure (rack, power domain, data center, or region) does not take down the whole platform.
Workloads can either survive failures transparently or fail over in a controlled way.

Basic terms used in OpenShift scheduling and infrastructure:

Zone
A fault domain inside a region (e.g., AWS us-east-1a, GCP europe-west1-b). Zones usually share:

Low-latency, high-bandwidth network
Common control plane endpoints
But have independent power, cooling, and physical infrastructure.

Region
A geographically separated data center area (e.g., AWS us-east-1 vs us-west-2, on-prem “DC-East” vs “DC-West”). Regions typically have:

Higher latency between each other
Independent networking and power infrastructures
Separate sets of zones

Failure domain
A logical label representing a “thing that can fail together”. OpenShift commonly uses:

topology.kubernetes.io/region
topology.kubernetes.io/zone

These topology labels are attached to nodes (or machine sets) and are used for scheduling and spreading workloads.

Multi-zone clusters

A multi-zone OpenShift cluster has worker (and in some modes, control plane) nodes spread across multiple availability zones within a single region.

Goals:

Keep the cluster available if one zone fails.
Spread workloads to avoid “all eggs in one basket”.
Keep latency low (zones within a region usually have fast links).

Control plane placement in multi-zone setups

OpenShift requires an odd number of control plane nodes for etcd quorum (commonly 3).

Patterns:

3 control plane nodes, each in a different zone

Recommended for cloud IPI deployments where infrastructure supports zonal distribution.
If one zone fails, etcd still has 2/3 nodes and quorum is preserved.

3 control plane nodes all in one zone

Simpler, but that zone is a single point of failure for the cluster.
Sometimes used in on-prem or constrained scenarios, but not ideal for HA.

In multi-zone production environments, distributing control plane nodes across zones is the typical approach when supported.

Worker node distribution and machine sets

Workers are usually created using MachineSets (in IPI or when using Machine API). For multi-zone:

One MachineSet per zone, with:

Node labels like:

topology.kubernetes.io/region=<region-name>
topology.kubernetes.io/zone=<zone-name>

Possibly additional labels to control workload placement (e.g., node-role.kubernetes.io/worker).

Example idea (no full YAML here, just the concepts):

machineset-worker-zone-a
machineset-worker-zone-b
machineset-worker-zone-c

Each MachineSet scales independently, so you can adjust capacity per zone.

Topology-aware scheduling and spreading

To achieve resilience across zones, the cluster and workloads use:

Pod anti-affinity / topology spread constraints
So replicas of the same application are distributed across zones rather than landing in a single one.
Default topology spread (in newer Kubernetes/OpenShift versions)
The scheduler attempts to spread pods across zones and nodes, reducing skew without requiring per-app config.

Key mechanisms:

topologySpreadConstraints:

Spread replicas across:

topologyKey: topology.kubernetes.io/zone
Possibly also topology.kubernetes.io/region

PodDisruptionBudgets:

Ensure operations (like node drains) don’t evict too many replicas in a given zone or overall.

Effect: if a whole zone becomes unavailable, remaining zones still run enough replicas (if you planned replica counts and resource capacity correctly).

Storage in multi-zone clusters

Multi-zone + storage is where practical constraints show up clearly:

Zonal storage (common case in cloud):
Persistent volumes are tied to a single zone:

If a node in another zone schedules a pod that needs that PV, it likely can’t attach it.
For stateful workloads with zone-local storage:

Pods are effectively constrained to that same zone.
You need node affinity or topology-aware provisioning to keep pods and PVs in the same zone.

Regionally replicated storage:
Some storage backends provide:

Synchronous or asynchronous replication across zones
Volume classes that can be attached from any zone
Higher cost, but better resilience.

On OpenShift, this is managed through:

StorageClasses with:

Zonal topology or
Multi-zone/replicated capabilities.

Dynamic provisioning that understands topology labels.

For HA designs, you must align:

PVC access modes and storage topology
Pod scheduling rules
Expected zone-failure behavior (failover vs per-zone isolation)

Load balancing and ingress in multi-zone environments

In a multi-zone cluster within a single region:

A single cluster endpoint (e.g., cloud load balancer) usually fronts all API servers across zones.
For applications:

Ingress Controllers (using OpenShift Router) run on nodes in multiple zones.
The cloud load balancer (or external LB) spreads traffic across routers in different zones.

Considerations:

If one zone fails:

API load balancer fails over to remaining zones’ control plane nodes.
Application ingresses continue to serve from routers in surviving zones.

Health checks and LB config must:

Detect unavailable nodes/routers.
Avoid sending traffic to failed zones.

Multi-region architectures with OpenShift

A multi-region design introduces higher latency, more complexity, and more independence between sites.

Core distinction:

Multi-zone: usually one cluster across multiple zones within a region.
Multi-region: usually multiple clusters, one per region, coordinated at a higher layer.

Running a single OpenShift cluster stretched across multiple regions is typically not recommended due to:

etctd and control plane latency requirements
Network reliability between regions
Operational complexity and failure modes

Instead, common patterns are:

Multiple clusters, one per region
Workload-level mechanisms for:

Traffic routing
Data replication
Failover and recovery

Multi-region topologies

Typical patterns:

Active–active across regions

At least two independent OpenShift clusters, each in its own region.
Both clusters serve real user traffic.
Data is replicated between regions (synchronous or asynchronous, depending on requirements).
Global traffic is routed using:

DNS-based routing (e.g., weighted or latency-based)
Global load balancers

Pros:

High availability
Lower latency for users near each region

Cons:

Complex data consistency and conflict resolution
Cost (duplicate infrastructure)

Active–passive (cold or warm standby)

One primary region/cluster serves traffic (active).
One or more secondary regions/clusters are on standby:

Cold standby: minimal or no resources running; spin up on failover.
Warm standby: scaled-down version of production running, can scale up quickly.

Data is replicated from active to passive:

Typically asynchronous to avoid impacting primary latency.

Failover is:

DNS or LB change to direct traffic to backup region.
Requires tested runbooks or automation.

Pros:

Simpler data flows
Less risk of data conflicts

Cons:

Failover time
Possible data loss window (RPO > 0)

Regional isolation with shared services

Each region has its own “self-contained” OpenShift cluster for local workloads.
Some shared platform services or data platforms may be global or regionally replicated.
Often used in:

Regulatory or data-sovereignty-driven environments
Enterprises with region-specific workloads but shared tooling.

Cross-cluster coordination: traffic and service discovery

Since multi-region often means multi-cluster, you need a way to:

Expose applications across regions
Failover between clusters
Potentially route users to the “closest” or “healthiest” region

Common mechanisms:

DNS-based routing

Multiple A/CNAME records pointing to region-specific ingress endpoints.
Policies:

Latency-based routing
Geolocation-based routing
Weighted distribution
Failover (primary + secondary)

Advantages: simple, widely supported.
Drawbacks: DNS caching can delay failover.

Global/Anycast load balancers

Global front-end IP address that routes traffic to:

Ingress endpoints in different regions.

L7 health checks to route traffic away from unhealthy regions.

Service mesh and multi-cluster gateways

Mesh implementations can:

Expose services across clusters
Provide fine-grained traffic shifting (e.g., canary across regions)

Useful for advanced scenarios but adds complexity.

In all cases, cluster-level ingress (OpenShift Routes/IngressControllers) operate per cluster; global routing layers sit above them.

Data management and replication

Data is usually the hardest part of multi-region design. Approaches:

Stateless services
Easiest: deploy identical stateless workloads in each region; rely on:

Shared global services (e.g., object storage)
Or region-local data that is allowed to be diverged

Database-level replication

Multi-region HA is pushed down to:

Database replication (e.g., primary–replica, multi-primary, or cluster replication).

OpenShift clusters host application pods that connect to these replicated databases.
Choices affect:

RPO (data loss tolerance)
RTO (time to switch regions)
Consistency guarantees (strong vs eventual).

Storage-system replication

Some storage systems provide:

Region-to-region replication of volumes or object stores.

Applications might:

Fail over to replicated volumes
Read from local copies and handle potential staleness.

Application design must be aware of:

Whether a region might lose last transactions on failover.
How to handle data conflicts (in active-active).
How to resync after a region comes back online.

Designing applications for multi-zone vs multi-region

The same application might behave differently depending on deployment pattern.

Multi-zone readiness

For multi-zone resilience, applications should:

Run with enough replicas to survive a zone loss:

Example: 3 zones → at least 3 replicas spread 1 per zone.

Use topology spreading or anti-affinity to:

Avoid all replicas landing in one zone.

For stateful workloads:

Understand storage topology:

If PV is zone-local, ensure pods are scheduled in that zone.
If using replicated/multi-zone storage, still consider performance impacts.

Applications in multi-zone setups normally expect:

Short network latencies between zones.
Shared cluster control plane and API.

Multi-region readiness

For multi-region, applications should additionally:

Be deployable in multiple clusters with:

Same configuration templates (e.g., GitOps).
Environment-specific values (endpoints, storage classes).

Handle:

Different latency profiles per region.
Region-specific failures and partial outages.

Be explicit about:

Data ownership per region (e.g., sharding by geography vs full replication).
Consistency requirements when switching regions.

Patterns:

Read local, write primary:

Reads served in each region.
Writes go to a single primary region.
Failover involves promoting a new primary.

Region-local write with eventual consistency:

Regions can accept writes locally.
Background replication synchronizes changes.
Conflicts handled by app logic or database conflict resolution.

Operational aspects and trade-offs

Multi-zone and multi-region HA bring operational challenges that must be balanced with cost and complexity.

Cost vs availability

Multi-zone:

Additional nodes and LBs per zone, but still one cluster.
Often considered a baseline for production cloud deployments.

Multi-region:

Doubling (or more) of infrastructure.
Potential duplication of stateful services and platform components.
Increased operational overhead (more clusters to manage).

Decision factors:

Required SLA (uptime, RTO, RPO).
Regulatory or latency-driven regional needs.
Budget and team expertise.

Operational complexity

With multi-zone:

Complexity mainly around:

Node distribution
Storage topology
Ensuring workloads are properly spread

With multi-region:

Additionally, you must manage:

Multiple clusters’ lifecycle (upgrades, capacity, security).
Global traffic routing strategy and automation.
Data replication and failover procedures.
Runbooks for partial-region failure, full-region failure, and “split-brain” scenarios.

Testing is crucial:

Regularly simulate:

Loss of a single zone
Loss of control plane node(s)
Entire region outage (in non-production or carefully controlled environments)

Summary of when to choose which design

Single-region, multi-zone cluster:

Preferred for:

Most standard production workloads.
Where regional outages are rare or acceptable risks.

Pros:

Simpler operations.
Single control plane.
Good protection against many kinds of failures.

Multi-region, multi-cluster setup:

Consider when:

You must survive total loss of a region.
You need low-latency access from distant geographies.
Regulations require regional segregation.

Pros:

Higher resilience against large-scale failures.
Better per-region latency.

Cons:

Significant complexity and cost.
Non-trivial data consistency challenges.

An effective OpenShift HA strategy usually starts with solid multi-zone design, and only extends to multi-region when business continuity, geography, or compliance requirements clearly justify the added complexity.

Comments

Please login to add a comment.

Don't have an account? Register now!