Table of Contents
Core Concepts of Failover
Failover is the controlled handover of a workload from one component (node, service, link, data center) to another when the first becomes unavailable or unhealthy.
You design failover around three key questions:
- What fails?
Node, service, storage, network path, data center. - Who decides it failed?
Cluster stack, monitoring agent, load balancer, external controller. - Where does it go?
Standby node, any available node, DR site, degraded mode on same node.
High availability (HA) clustering uses automated failover to turn single-node failures into brief service interruptions, not outages.
Two critical metrics shape your design:
- RTO (Recovery Time Objective) — acceptable downtime during failover.
- RPO (Recovery Point Objective) — acceptable data loss (seconds, minutes, none).
Your failover strategy is essentially about meeting RTO/RPO within cost and complexity limits.
Active/Passive vs Active/Active Failover
Active/Passive
- One node (or site) is active, one is standby.
- The standby may be:
- Hot: fully up, ready to take over instantly.
- Warm: running but needs some work on failover (e.g., attach volumes, start services).
- Cold: powered off until needed (slowest, cheapest).
Pros:
- Simple mental model.
- Easy to reason about capacity and failover behavior.
- Often easier to guarantee consistency (only one active writer).
Cons:
- Standby capacity is underutilized.
- Failover paths must be tested regularly or bitrot will break them.
- Risk of configuration drift if not managed carefully.
Active/Active
- Multiple nodes share the workload concurrently.
- Failover is often just load redistribution when a node disappears.
- Some services can be scaled horizontally (web frontends); others need more complex replication (databases).
Pros:
- Uses hardware efficiently.
- Often faster recovery (side effect of load balancing).
- Can handle both failover and scale-out.
Cons:
- Harder to maintain data consistency.
- Requires load-aware components (load balancers, distributed locks, etc.).
- Complex failure modes: partial failures, split-brain-like conditions in shared state.
In practice, you might mix them: e.g., active/active stateless web tier, active/passive stateful database.
Failover Triggers and Detection
Failover starts with detection. Triggering too late increases downtime; triggering too early causes instability.
Typical Triggers
- Node failure
- No heartbeat.
- Host unreachable via management network.
- Service failure
- Process exits or hangs.
- Health check (HTTP, SQL, custom script) fails.
- Storage failure
- Filesystem unmounts or goes read-only.
- Volume or device disappears.
- Network failure
- Loss of default gateway or important routes.
- VIP (virtual IP) not reachable from peers.
- Site or power failure (for multi-DC setups).
Failure Detection Methods
- Heartbeats
- Regular messages over one or more networks.
- Loss of multiple consecutive heartbeats ⇒ suspect failure.
- Health checks
- Application-level checks (e.g.,
GET /healthz). - Database queries such as
SELECT 1. - Watchdogs
- Hardware/software components that force a reboot if the node stops responding locally.
- External monitoring
- Load balancer, monitoring system, or orchestrator decides a node is unhealthy and drains/evicts it.
You almost never want to fail over on single missed heartbeat. Use thresholds (e.g., 3 missed heartbeats) and multiple sources of truth to avoid flapping.
Coordinating Resources During Failover
Failover is not just starting a service on another node. You must move or reassign associated resources consistently:
- IP addresses / VIPs
- Floating IPs that move between nodes.
- DNS records updated with low TTL in some architectures.
- Storage resources
- Mount/unmount shared disks or volumes.
- Switch primary role in replicated storage.
- Application services
- Stop on failed node (or fence it), start on new node.
- Restore configuration, environment, and secrets.
- Dependencies
- Start/stop resources in the right order: storage → database → app → IP/VIP.
Cluster stacks model this as resource groups with ordering and colocation constraints so the cluster can move the entire service atomically.
Avoiding Split-Brain
Split-brain occurs when two or more nodes believe they are primary for the same resource (especially storage or a stateful service). This is catastrophic: data divergence, corruption, client confusion.
Why It Happens
- Network partition isolates nodes from each other but they’re still up.
- Heartbeat network fails but production and storage networks are fine.
- Faulty fencing (or no fencing) allows multiple “masters” to continue operating.
Preventive Principles
1. Quorum
Use a quorum mechanism so only the majority partition can make changes:
- In an
N-node cluster, you define a quorum rule, often “> N/2 active and communicating nodes”. - Minority partitions go into standby / read-only / no-ops mode.
- For even numbers of nodes, add tie-breakers:
- Quorum devices (e.g., a disk or external node).
- Third site or lightweight arbitrator.
2. Fencing (STONITH)
Fencing ensures that a node that might be alive but unreachable is forcibly removed:
- Power fencing (power off via PDU, IPMI, etc.).
- Storage fencing (revoke disk access, SCSI reservations).
- Network fencing (cut switch ports, revoke VIP assignment).
The principle: “Better to kill a maybe-alive node than risk two primaries.”
3. Single-Writer Rules
For storage and some stateful services:
- Strictly one primary writer; secondaries are read-only or log-based replicas.
- Role transitions (primary↔secondary) must be atomic and controlled by the cluster.
4. Independent Heartbeat Paths
- Multiple, redundant heartbeat channels (different NICs, switches, sometimes different media).
- Lower the chance that heartbeats fail while the node is otherwise healthy.
Planned vs Unplanned Failover
Unplanned Failover
Triggered by unexpected failure:
- Node crash, kernel panic, power outage.
- Application crash or hang.
- Network or hardware failure.
Characteristics:
- Must be automated.
- Focus on speed and correctness.
- Give up on graceful shutdown; use fencing and recovery procedures.
Planned Failover (Switchover)
You intentionally move workloads:
- Maintenance on hardware/OS.
- Upgrades, migrations, patching.
- Load balancing between sites.
Characteristics:
- Can be interactive or scripted.
- Give services time for graceful stop (flush caches, close connections).
- Use to test your failover path.
Design your clusters so planned failover follows the same path as unplanned failover, just with more grace and logging.
Stateful vs Stateless Failover
Failover principles differ sharply between stateless and stateful components.
Stateless Services
Examples: HTTP frontends, API gateways, some microservices.
- No significant state kept locally.
- Requests can go to any instance.
- Failover usually happens at:
- Load balancer level: stop sending traffic to failed node.
- Service discovery: unhealthy instances removed from registry.
Key consequences:
- Mostly active/active.
- RPO is effectively $0$ (no data to lose).
- RTO is often limited to health-check intervals.
Stateful Services
Examples: databases, message queues, stateful monoliths.
Challenges:
- Must preserve consistency and durability.
- Usually require one primary writer; failover includes role change.
- Risk of data loss depending on replication mode:
- Synchronous replication: lower RPO, higher latency.
- Asynchronous replication: better latency, but can lose recent data on failover.
Key principles:
- Failover steps generally include:
- Confirm old primary is dead or fenced.
- Choose best candidate secondary (most up-to-date).
- Promote secondary to primary.
- Redirect traffic (VIP, connection strings, service discovery).
- Reintegrate old primary as a new secondary once safe.
Network Identity and Client Transparency
To make failover invisible or minimal to clients, you manage where they connect rather than which node they connect to.
Common strategies:
- Virtual IP (VIP):
- A shared IP address moved by cluster software between nodes.
- ARP announcements or gratuitous ARPs update switches/peers.
- DNS-based failover:
- Update DNS records to point to new node or site.
- Relies on low TTL or DNS-based load balancing services.
- Not ideal for very fast failover due to caching.
- Load balancers / reverse proxies:
- Frontend address stays the same.
- Backends are added/removed based on health checks.
- Service discovery systems (e.g., Consul, etcd-based tooling):
- Clients query service registry for available endpoints.
- Registry updates in response to node health.
Principle: abstract service identity from physical node identity.
Minimizing RTO and RPO
You use architecture and configuration to tune RTO and RPO.
Techniques to Reduce RTO
- Aggressive health-check and heartbeat intervals
- But avoid false positives (use multiple checks, thresholds).
- Hot standbys
- Pre-initialized instances with configuration and data ready.
- Automation
- No manual steps in the failover path.
- Fast storage and network
- Faster re-mount, shorter ARP convergence, quick promotion.
- Pre-warmed caches where relevant.
Techniques to Improve (Lower) RPO
- Synchronous or semi-synchronous replication
- Accept write only after replica acknowledges.
- Frequent snapshots or log shipping
- More frequent intervals lower worst-case data loss.
- Write-ahead logging (WAL) streaming
- Continuous replication from primary to one or more secondaries.
- Careful failure criteria
- Avoid premature promotion to secondary that is actually far behind.
Always validate what RPO you achieve in reality, not just in theory, by testing failure scenarios.
Testing and Validating Failover
Failover mechanisms are not trustworthy until they’re exercised under realistic conditions.
Types of Tests
- Planned failover drills
- Intentionally migrate services between nodes or sites.
- Failure injection
- Kill processes, unplug network cables, hard power off nodes.
- Disaster scenarios
- Simulate loss of an entire rack, switch, or site.
What to Measure
- Actual RTO during each scenario.
- Any data loss (RPO).
- Application behavior:
- Are errors properly surfaced or retried?
- Do clients recover automatically or need manual restart?
- Cluster decision-making:
- Was fencing executed properly?
- Any split-brain episodes or near misses?
- Were all dependencies moved correctly?
Operational Readiness
- Clear runbooks documenting:
- How to trigger planned failover.
- How to verify success.
- How to roll back if needed.
- Logs and monitoring:
- Alerts on failover events.
- Metrics for failover frequency and duration.
Human and Operational Principles
The technology only works if operational practices support it.
Key principles:
- Simplicity over cleverness
- Minimal necessary nodes and roles.
- Avoid over-optimization that complicates failure modes.
- Consistency
- Use configuration management; avoid snowflake nodes.
- Separation of concerns
- Clear boundaries: what clustering controls vs what external tooling controls.
- Documented policies
- When to trigger manual failover.
- When to not fail over (e.g., upstream outage where failover won’t help).
- Post-mortems
- After every significant failover, analyze:
- Root cause.
- Whether failover worked as expected.
- Improvements to policies, thresholds, or architecture.
By applying these principles consistently, you move from ad-hoc “restart and hope” towards predictable, controlled failover that underpins real high availability.