Kahibaro
Discord Login Register

Failover principles

Core Concepts of Failover

Failover is the controlled handover of a workload from one component (node, service, link, data center) to another when the first becomes unavailable or unhealthy.

You design failover around three key questions:

  1. What fails?
    Node, service, storage, network path, data center.
  2. Who decides it failed?
    Cluster stack, monitoring agent, load balancer, external controller.
  3. Where does it go?
    Standby node, any available node, DR site, degraded mode on same node.

High availability (HA) clustering uses automated failover to turn single-node failures into brief service interruptions, not outages.

Two critical metrics shape your design:

Your failover strategy is essentially about meeting RTO/RPO within cost and complexity limits.

Active/Passive vs Active/Active Failover

Active/Passive

Pros:

Cons:

Active/Active

Pros:

Cons:

In practice, you might mix them: e.g., active/active stateless web tier, active/passive stateful database.

Failover Triggers and Detection

Failover starts with detection. Triggering too late increases downtime; triggering too early causes instability.

Typical Triggers

Failure Detection Methods

You almost never want to fail over on single missed heartbeat. Use thresholds (e.g., 3 missed heartbeats) and multiple sources of truth to avoid flapping.

Coordinating Resources During Failover

Failover is not just starting a service on another node. You must move or reassign associated resources consistently:

Cluster stacks model this as resource groups with ordering and colocation constraints so the cluster can move the entire service atomically.

Avoiding Split-Brain

Split-brain occurs when two or more nodes believe they are primary for the same resource (especially storage or a stateful service). This is catastrophic: data divergence, corruption, client confusion.

Why It Happens

Preventive Principles

1. Quorum

Use a quorum mechanism so only the majority partition can make changes:

2. Fencing (STONITH)

Fencing ensures that a node that might be alive but unreachable is forcibly removed:

The principle: “Better to kill a maybe-alive node than risk two primaries.”

3. Single-Writer Rules

For storage and some stateful services:

4. Independent Heartbeat Paths

Planned vs Unplanned Failover

Unplanned Failover

Triggered by unexpected failure:

Characteristics:

Planned Failover (Switchover)

You intentionally move workloads:

Characteristics:

Design your clusters so planned failover follows the same path as unplanned failover, just with more grace and logging.

Stateful vs Stateless Failover

Failover principles differ sharply between stateless and stateful components.

Stateless Services

Examples: HTTP frontends, API gateways, some microservices.

Key consequences:

Stateful Services

Examples: databases, message queues, stateful monoliths.

Challenges:

Key principles:

Network Identity and Client Transparency

To make failover invisible or minimal to clients, you manage where they connect rather than which node they connect to.

Common strategies:

Principle: abstract service identity from physical node identity.

Minimizing RTO and RPO

You use architecture and configuration to tune RTO and RPO.

Techniques to Reduce RTO

Techniques to Improve (Lower) RPO

Always validate what RPO you achieve in reality, not just in theory, by testing failure scenarios.

Testing and Validating Failover

Failover mechanisms are not trustworthy until they’re exercised under realistic conditions.

Types of Tests

What to Measure

Operational Readiness

Human and Operational Principles

The technology only works if operational practices support it.

Key principles:

By applying these principles consistently, you move from ad-hoc “restart and hope” towards predictable, controlled failover that underpins real high availability.

Views: 26

Comments

Please login to add a comment.

Don't have an account? Register now!