Failover principles

Core Concepts of Failover

Failover is the controlled handover of a workload from one component (node, service, link, data center) to another when the first becomes unavailable or unhealthy.

You design failover around three key questions:

What fails?
Node, service, storage, network path, data center.
Who decides it failed?
Cluster stack, monitoring agent, load balancer, external controller.
Where does it go?
Standby node, any available node, DR site, degraded mode on same node.

High availability (HA) clustering uses automated failover to turn single-node failures into brief service interruptions, not outages.

Two critical metrics shape your design:

RTO (Recovery Time Objective) — acceptable downtime during failover.
RPO (Recovery Point Objective) — acceptable data loss (seconds, minutes, none).

Your failover strategy is essentially about meeting RTO/RPO within cost and complexity limits.

Active/Passive vs Active/Active Failover

Active/Passive

One node (or site) is active, one is standby.
The standby may be:

Hot: fully up, ready to take over instantly.
Warm: running but needs some work on failover (e.g., attach volumes, start services).
Cold: powered off until needed (slowest, cheapest).

Pros:

Simple mental model.
Easy to reason about capacity and failover behavior.
Often easier to guarantee consistency (only one active writer).

Cons:

Standby capacity is underutilized.
Failover paths must be tested regularly or bitrot will break them.
Risk of configuration drift if not managed carefully.

Active/Active

Multiple nodes share the workload concurrently.
Failover is often just load redistribution when a node disappears.
Some services can be scaled horizontally (web frontends); others need more complex replication (databases).

Pros:

Uses hardware efficiently.
Often faster recovery (side effect of load balancing).
Can handle both failover and scale-out.

Cons:

Harder to maintain data consistency.
Requires load-aware components (load balancers, distributed locks, etc.).
Complex failure modes: partial failures, split-brain-like conditions in shared state.

In practice, you might mix them: e.g., active/active stateless web tier, active/passive stateful database.

Failover Triggers and Detection

Failover starts with detection. Triggering too late increases downtime; triggering too early causes instability.

Typical Triggers

Node failure

No heartbeat.
Host unreachable via management network.

Service failure

Process exits or hangs.
Health check (HTTP, SQL, custom script) fails.

Storage failure

Filesystem unmounts or goes read-only.
Volume or device disappears.

Network failure

Loss of default gateway or important routes.
VIP (virtual IP) not reachable from peers.

Site or power failure (for multi-DC setups).

Failure Detection Methods

Heartbeats

Regular messages over one or more networks.
Loss of multiple consecutive heartbeats ⇒ suspect failure.

Health checks

Application-level checks (e.g., GET /healthz).
Database queries such as SELECT 1.

Watchdogs

Hardware/software components that force a reboot if the node stops responding locally.

External monitoring

Load balancer, monitoring system, or orchestrator decides a node is unhealthy and drains/evicts it.

You almost never want to fail over on single missed heartbeat. Use thresholds (e.g., 3 missed heartbeats) and multiple sources of truth to avoid flapping.

Coordinating Resources During Failover

Failover is not just starting a service on another node. You must move or reassign associated resources consistently:

IP addresses / VIPs

Floating IPs that move between nodes.
DNS records updated with low TTL in some architectures.

Storage resources

Mount/unmount shared disks or volumes.
Switch primary role in replicated storage.

Application services

Stop on failed node (or fence it), start on new node.
Restore configuration, environment, and secrets.

Dependencies

Start/stop resources in the right order: storage → database → app → IP/VIP.

Cluster stacks model this as resource groups with ordering and colocation constraints so the cluster can move the entire service atomically.

Avoiding Split-Brain

Split-brain occurs when two or more nodes believe they are primary for the same resource (especially storage or a stateful service). This is catastrophic: data divergence, corruption, client confusion.

Why It Happens

Network partition isolates nodes from each other but they’re still up.
Heartbeat network fails but production and storage networks are fine.
Faulty fencing (or no fencing) allows multiple “masters” to continue operating.

Preventive Principles

1. Quorum

Use a quorum mechanism so only the majority partition can make changes:

In an N-node cluster, you define a quorum rule, often “> N/2 active and communicating nodes”.
Minority partitions go into standby / read-only / no-ops mode.
For even numbers of nodes, add tie-breakers:

Quorum devices (e.g., a disk or external node).
Third site or lightweight arbitrator.

2. Fencing (STONITH)

Fencing ensures that a node that might be alive but unreachable is forcibly removed:

Power fencing (power off via PDU, IPMI, etc.).
Storage fencing (revoke disk access, SCSI reservations).
Network fencing (cut switch ports, revoke VIP assignment).

The principle: “Better to kill a maybe-alive node than risk two primaries.”

3. Single-Writer Rules

For storage and some stateful services:

Strictly one primary writer; secondaries are read-only or log-based replicas.
Role transitions (primary↔secondary) must be atomic and controlled by the cluster.

4. Independent Heartbeat Paths

Multiple, redundant heartbeat channels (different NICs, switches, sometimes different media).
Lower the chance that heartbeats fail while the node is otherwise healthy.

Planned vs Unplanned Failover

Unplanned Failover

Triggered by unexpected failure:

Node crash, kernel panic, power outage.
Application crash or hang.
Network or hardware failure.

Characteristics:

Must be automated.
Focus on speed and correctness.
Give up on graceful shutdown; use fencing and recovery procedures.

Planned Failover (Switchover)

You intentionally move workloads:

Maintenance on hardware/OS.
Upgrades, migrations, patching.
Load balancing between sites.

Characteristics:

Can be interactive or scripted.
Give services time for graceful stop (flush caches, close connections).
Use to test your failover path.

Design your clusters so planned failover follows the same path as unplanned failover, just with more grace and logging.

Stateful vs Stateless Failover

Failover principles differ sharply between stateless and stateful components.

Stateless Services

Examples: HTTP frontends, API gateways, some microservices.

No significant state kept locally.
Requests can go to any instance.
Failover usually happens at:

Load balancer level: stop sending traffic to failed node.
Service discovery: unhealthy instances removed from registry.

Key consequences:

Mostly active/active.
RPO is effectively $0$ (no data to lose).
RTO is often limited to health-check intervals.

Stateful Services

Examples: databases, message queues, stateful monoliths.

Challenges:

Must preserve consistency and durability.
Usually require one primary writer; failover includes role change.
Risk of data loss depending on replication mode:

Synchronous replication: lower RPO, higher latency.
Asynchronous replication: better latency, but can lose recent data on failover.

Key principles:

Failover steps generally include:

Confirm old primary is dead or fenced.
Choose best candidate secondary (most up-to-date).
Promote secondary to primary.
Redirect traffic (VIP, connection strings, service discovery).
Reintegrate old primary as a new secondary once safe.

Network Identity and Client Transparency

To make failover invisible or minimal to clients, you manage where they connect rather than which node they connect to.

Common strategies:

Virtual IP (VIP):

A shared IP address moved by cluster software between nodes.
ARP announcements or gratuitous ARPs update switches/peers.

DNS-based failover:

Update DNS records to point to new node or site.
Relies on low TTL or DNS-based load balancing services.
Not ideal for very fast failover due to caching.

Load balancers / reverse proxies:

Frontend address stays the same.
Backends are added/removed based on health checks.

Service discovery systems (e.g., Consul, etcd-based tooling):

Clients query service registry for available endpoints.
Registry updates in response to node health.

Principle: abstract service identity from physical node identity.

Minimizing RTO and RPO

You use architecture and configuration to tune RTO and RPO.

Techniques to Reduce RTO

Aggressive health-check and heartbeat intervals

But avoid false positives (use multiple checks, thresholds).

Hot standbys

Pre-initialized instances with configuration and data ready.

Automation

No manual steps in the failover path.

Fast storage and network

Faster re-mount, shorter ARP convergence, quick promotion.

Pre-warmed caches where relevant.

Techniques to Improve (Lower) RPO

Synchronous or semi-synchronous replication

Accept write only after replica acknowledges.

Frequent snapshots or log shipping

More frequent intervals lower worst-case data loss.

Write-ahead logging (WAL) streaming

Continuous replication from primary to one or more secondaries.

Careful failure criteria

Avoid premature promotion to secondary that is actually far behind.

Always validate what RPO you achieve in reality, not just in theory, by testing failure scenarios.

Testing and Validating Failover

Failover mechanisms are not trustworthy until they’re exercised under realistic conditions.

Types of Tests

Planned failover drills

Intentionally migrate services between nodes or sites.

Failure injection

Kill processes, unplug network cables, hard power off nodes.

Disaster scenarios

Simulate loss of an entire rack, switch, or site.

What to Measure

Actual RTO during each scenario.
Any data loss (RPO).
Application behavior:

Are errors properly surfaced or retried?
Do clients recover automatically or need manual restart?

Cluster decision-making:

Was fencing executed properly?
Any split-brain episodes or near misses?
Were all dependencies moved correctly?

Operational Readiness

Clear runbooks documenting:

How to trigger planned failover.
How to verify success.
How to roll back if needed.

Logs and monitoring:

Alerts on failover events.
Metrics for failover frequency and duration.

Human and Operational Principles

The technology only works if operational practices support it.