Table of Contents
Goals of High Availability
High availability (HA) aims to keep services accessible despite failures and planned maintenance. For load balancers, HA means:
- The service’s virtual IP / hostname remains reachable.
- At least one functioning load balancer is always available.
- Failures are detected and handled automatically (no manual intervention).
- Maintenance can be done with minimal or no downtime.
Typical goals are expressed as an “availability percentage”:
- $99\%$ (“two nines”): ~3.65 days downtime/year
- $99.9\%$ (“three nines”): ~8.8 hours/year
- $99.99\%$ (“four nines”): ~52.6 minutes/year
The higher the target, the more redundancy, complexity, and cost you accept.
Redundancy and Single Points of Failure
An HA design removes or mitigates single points of failure (SPOFs). In the load balancing context, look at:
- Network: single switch, router, or upstream link?
- Load balancer nodes: only one machine doing LB?
- Backend pool: only one app server or database?
- Storage: single disk, single NAS, single DB instance?
- Power: one PSU, one rack PDU, one UPS?
Common strategies:
- N+1 redundancy: N units required to handle normal load; one additional spare (e.g., 2 active web servers + 1 standby).
- N+M redundancy: N required + M spares (e.g., 3 required, 2 spare).
- N-way active: all N units are active and share traffic; capacity margin is provided by overprovisioning.
In load balancers, dual or multi load balancers are common: if any one fails, the VIP is moved or traffic is re-routed.
Failure Models and Failure Domains
You design HA based on which failures you expect and want to survive:
- Node failure: server crash, kernel panic, hardware fault.
- Service failure: LB process (e.g., HAProxy, Nginx) dies or misbehaves.
- Network partition: nodes can’t talk to each other (split network).
- Storage failure: disk, array, or storage network fails.
- Site or zone failure: datacenter, availability zone, or rack outage.
- Configuration / operational errors: bad deployment, misconfig, wrong firewall rule.
A failure domain is the scope of impact from a given failure:
- One physical host
- One rack
- One VLAN
- One availability zone
- One region
HA design tries to contain failures inside small domains and ensures components that must operate together (e.g., active and standby load balancers) are not in the same easily-failing domain when possible.
Active/Passive vs Active/Active Architectures
Active/Passive
- One active load balancer handles all traffic.
- One or more passive (standby) load balancers are ready but idle.
- When the active fails, passive takes over the virtual IP / hostname (failover).
Pros:
- Simple state model.
- Easy to reason about failover.
- Good where session state or complex config is involved.
Cons:
- Standby capacity is underused.
- Failover events are disruptive (albeit short).
- Risk of split-brain if failover detection is incorrect.
Used often with:
- Floating/virtual IPs via VRRP/Keepalived or cloud-native failover.
- Shared configuration (or configuration replication) between LB nodes.
Active/Active
- Multiple load balancers are simultaneously active.
- Traffic is distributed (via DNS, anycast, L3/L4 load balancing, or equal-cost routing).
Pros:
- Better capacity utilization.
- No single “master” whose failure is special.
- Scaling is easier (add more active nodes).
Cons:
- More complex routing and failure detection.
- May need state sharing (e.g., session stickiness, TLS session tickets).
- Debugging can be harder.
Often combined with:
- ECMP (Equal-Cost Multi-Path) routing or L4 load balancers above the L7 load balancers.
- DNS-based distribution (multiple A/AAAA records) with aggressive health checking.
Failover and Fencing
Failover Mechanisms
Failover is the process of moving service responsibility from a failed node to a healthy node. For load balancers, tasks include:
- Detecting that the active node is down or unhealthy.
- Electing a new active node (if active/passive).
- Moving the virtual IP address or updating routing.
- Reconfiguring ARP tables / neighbor caches (e.g., via gratuitous ARP).
- Optionally, warming caches or restoring in-memory state.
Detection mechanisms:
- Local checks: LB process is running and listening.
- Peer checks: nodes monitor each other via heartbeat packets or health APIs.
- External observer: third-party service or cluster manager decides.
Fencing (STONITH)
Fencing ensures a failed or partitioned node cannot still be serving traffic or writing data after being declared dead by the cluster. This avoids split-brain, where two nodes both believe they are active.
Common fencing actions:
- Power off the offending node (via IPMI/iDRAC/ILO).
- Disable its switch port.
- Reboot it via a management controller.
- Revoke or remove its ability to advertise the VIP.
STONITH (“Shoot The Other Node In The Head”) is a classic term for forcibly removing a node to protect data and consistency. Even for load balancing, fencing is useful to avoid two nodes answering as the same VIP at once.
Quorum and Split-Brain
In multi-node clusters, quorum is how the cluster decides who is allowed to operate when communication is partially lost.
- Quorum: a majority of nodes (or weighted votes) must agree the cluster is healthy before they serve as authoritative.
- Split-brain: the cluster splits into groups that each think the others are down; each group might attempt to be active, breaking consistency.
Common strategies:
- Odd number of voters (3, 5, 7…) to avoid ties.
- Quorum device / tie-breaker (e.g., a shared disk, a cloud witness node).
- No quorum, no service: when quorum is lost, the cluster deliberately stops serving to avoid split-brain.
In HA load balancer clusters, you might:
- Use a dedicated cluster manager (e.g., Pacemaker/Corosync) that enforces quorum for the VIP resource.
- Ensure only the quorum side continues advertising the VIP.
Health Checking and Failure Detection
The quality of your health checks directly influences HA behavior:
- Liveliness checks: “Is the node or process alive?”
- Ping, TCP port open, process exists.
- Readiness checks: “Can it actually serve correct responses?”
- HTTP 200 status, a test query or endpoint, business-level checks.
- Dependency checks: “Can it reach its dependencies?”
- Database, cache, other services, storage.
Considerations:
- Sensitivity: too aggressive → frequent false positives and needless failovers; too lax → long outages before failover.
- Health check interval and failure threshold (e.g., 3 failures in 10 seconds).
- Multi-level checks: LB health, OS health, network path health.
For load balancers:
- Upstream LBs or routers perform health checks on LB nodes.
- LBs perform health checks on backend servers.
- Cluster software performs health checks on each LB node’s roles (VIP owner, etc.).
Session State and HA
Load balancers often deal with stateful clients (e.g., web sessions). HA must consider what happens on failover or when an LB node disappears:
- Stateless design: ideal; all state in backend DB/cache; LBs can be replaced without impact.
- Load balancer–local state: connection tables, TLS sessions, local caches, rate limits, etc.
- Application-layer state: sticky sessions, in-memory user sessions, in-LB caches.
High availability options:
- Session replication between LBs (for sticky sessions, caches, etc.).
- External shared state (Redis, memcached, database).
- Consistent hashing and sticky mechanisms that spread the impact of node loss.
You must define acceptable client impact on failover:
- Is it okay if users lose sessions on LB failover?
- Do you require transparent failover where existing connections are preserved? (More complex; may depend on TCP/IP behavior and LB design.)
High Availability vs Fault Tolerance vs Disaster Recovery
These terms are related but distinct:
- High Availability
- Aim: minimize service downtime.
- Failover is allowed; brief interruptions may be acceptable.
- Typically multi-node, same site or same region.
- Fault Tolerance
- Aim: seamless operation with no visible interruption even if components fail.
- Often requires real-time redundancy and expensive solutions (e.g., lockstep hardware, instant state mirroring).
- Disaster Recovery (DR)
- Aim: recover service after a catastrophic event (site outage, data loss).
- May allow longer downtime (minutes to hours) but protects data and business continuity.
- Often uses secondary sites/regions with async replication and DR playbooks.
In a load balancing context:
- HA often covers failure of a single LB node or rack.
- DR covers loss of the entire data center (you fail over to another region with its own LBs).
Availability Metrics: RTO and RPO
When designing HA, especially with multiple sites, you must define:
- RTO (Recovery Time Objective)
- How quickly the service must be restored after an outage.
- E.g., RTO 60 seconds: automated failover must complete within 1 minute.
- RPO (Recovery Point Objective)
- How much data loss (in time) is acceptable.
- E.g., RPO 5 minutes: at most 5 minutes of recent data can be lost.
For pure load balancing layers (stateless), RPO often doesn’t apply directly, but:
- State tied to the LB (e.g., WAF rules, config, SSL certs) must be replicated so RPO ≈ 0.
- For backends (DB, storage), RPO drives the replication strategy that HA LBs will route to.
Maintenance, Upgrades, and Testing
High availability is not just about unexpected failures; it also enables planned work without downtime.
Key practices:
- Rolling maintenance:
- Drain connections from LB node A; move VIP / routing to node B.
- Upgrade A, validate, then fail back or continue upgrading others.
- Draining and graceful shutdown:
- Stop sending new connections.
- Let existing connections complete (or time out) before fully stopping.
- Configuration management and versioning:
- Keep LB config in Git or similar.
- Ensure all nodes get consistent, tested config.
Testing is critical:
- Regular failover drills: simulate LB node failure and measure real RTO.
- Chaos testing: randomly kill or isolate components in non-production to validate resilience.
- Monitoring + alerting: ensure you see pre-failure symptoms (CPU, memory, saturating links) before they cause outages.
Design Trade-Offs and Practical Guidelines
When applying HA concepts to load balancers:
- Start by identifying SPOFs:
- Single LB node? Single VIP? Single upstream network?
- Decide acceptable availability target (e.g., 99.9% vs 99.99%).
- Choose an architecture:
- Simple: two-node active/passive with VRRP/Keepalived.
- More advanced: multi-node active/active behind upstream routing or DNS.
- Implement:
- Robust health checks and conservative failover thresholds.
- Fencing for correctness in cluster scenarios.
- Quorum rules to avoid split-brain.
- Plan for:
- Session and state handling (stateless where possible).
- Routine failover tests and documented runbooks.
- Observability: logs, metrics, traces on both LB and backends.
These concepts form the foundation for building reliable, production-grade load balancing setups, whether you use HAProxy, Nginx, cloud-native load balancers, or a combination.