Clustering and High Availability

Table of Contents

Goals of Clustering and High Availability

Clustering and high availability (HA) aim to keep services running despite failures. This chapter focuses on:

Core HA concepts and terminology on Linux
Typical cluster architectures and components
How Linux HA building blocks fit together
Design considerations before you deploy Pacemaker/Corosync, distributed filesystems, load balancers, etc.

Implementation details for specific tools (Pacemaker, Corosync, distributed filesystems, HAProxy, etc.) are covered in their own chapters.

Key Concepts and Terminology

Availability vs Reliability vs Scalability

Availability: probability a service is usable at a given time. Often expressed as “nines”:

$99.0\%$ ≈ 3.65 days downtime/year
$99.9\%$ ≈ 8.76 hours/year
$99.99\%$ ≈ 52.6 minutes/year
$99.999\%$ ≈ 5.26 minutes/year

Reliability: how often a component fails (e.g., disk MTBF).
Scalability: how easily you can add capacity (more nodes, more CPU, etc.).

HA clustering focuses on availability, sometimes at the cost of complexity.

RTO and RPO

RTO (Recovery Time Objective): how long your service may be down during an incident.
RPO (Recovery Point Objective): how much data loss (time window) is acceptable.

Clustering strategies are selected to meet explicit RTO/RPO targets.

Redundancy Models

Active–Passive: one node serves traffic, another is on standby.

Simpler logic
Often easier to guarantee data consistency

Active–Active: multiple nodes serve traffic simultaneously.

Better resource usage and scalability
Requires careful state and data synchronization

Failover and Fallback

Failover: automatic move of a service from a failed node to a healthy one.
Failback: moving the service back to the original node once it’s healthy again.

Policies govern when and whether failback happens (immediate, manual, delayed).

Single Points of Failure (SPOFs)

An HA design tries to avoid any single component whose failure breaks the service:

Hardware: single NIC, single PSU, single disk.
Network: single switch, single router.
Software: single database node, single API node.
Infrastructure: single power feed, single storage array, single DNS provider.

Eliminating SPOFs usually increases cost and complexity; you must prioritize based on business needs.

Types of Clusters

High Availability (Failover) Clusters

Purpose: keep services running when nodes or processes fail.
Usually small (2–5 nodes).
Managed by a cluster stack (e.g., Corosync + Pacemaker).
Shared state via:

Shared storage (SAN, NAS, distributed FS), or
Replication (database replication, block replication, etc.).

Typical use: databases, NFS servers, important stateful services.

Load Balancing Clusters

Purpose: distribute requests across multiple servers to:

Handle higher load
Increase availability

Often stateless service nodes behind:

Layer 4 load balancer (TCP)
Layer 7 load balancer (HTTP-aware)

Session persistence (“sticky sessions”) can be needed for stateful web apps.

Covered in more detail in the Load Balancing chapter; here you should recognize that load balancing is one HA technique, often combined with backend failover.

Storage Clusters

Purpose: provide redundant, consistent storage:

Shared block storage (e.g., iSCSI, Fibre Channel)
Distributed filesystems (e.g., CephFS, GlusterFS)

Provide:

Replication or erasure coding
Self-healing
Consistent views of data across nodes

Storage clusters are often a foundation for HA application clusters.

Geo-Clusters (Site-to-Site Failover)

Spread across multiple locations (data centers / availability zones).
Handle site-level failures:

Power outage
Network partition
Natural disaster

Require:

Cross-site replication for data
DNS- or IP-based traffic steering
Strict split-brain protections and clear runbooks

RTO/RPO are typically larger and more complex to manage compared to local failover.

Core Building Blocks in Linux HA

Cluster Membership and Quorum

Membership: which nodes are considered part of the cluster and “online”.
Quorum: majority of nodes must agree before the cluster can make changes (like starting or stopping services).

In a simple N-node cluster:

Quorum threshold is typically $\lfloor \frac{N}{2} \rfloor + 1$.

Prevents split-brain: two partitions each believing they’re the real cluster.

Mechanisms include:

Voting nodes
Quorum devices (tie-breakers)
Witness servers or disks

Fencing (STONITH)

When a node appears dead but might still be running, you must fence it before starting its services elsewhere.

STONITH: “Shoot The Other Node In The Head”

Poweroff or reset via:

IPMI/iDRAC/iLO
Smart PDUs
Hypervisor APIs

Prevents:

Two nodes writing to the same data set (corruption)
Conflicting network identities (duplicate IPs)

Good fencing is non-optional in serious HA designs.

Virtual IPs and Service Identity

Services often expose a single IP or hostname, regardless of which node currently runs them:

Virtual IP (VIP):

Moves between nodes with the service
Announced via ARP or routing updates

DNS:

May be used instead of (or in addition to) VIPs, but DNS caching makes rapid failover tricky.

The cluster manager controls where the VIP lives and associates it with resource health.

Resource Agents and Service Abstractions

In an HA cluster, services are abstracted as resources with standardized actions:

start
stop
monitor
reload
promote/demote (for master/slave or primary/replica roles)

These are implemented by resource agents (scripts/programs that know how to manage the real service). The cluster engine uses these to automate failover decisions.

Health Checks and Monitoring

Health is checked at multiple levels:

Node-level:

CPU, RAM, basic connectivity, heartbeat messages

Resource-level:

Is the process running?
Is the port open?
Does a simple query succeed?

External health:

Can the service complete a representative request (e.g., HTTP check, DB query)?

Failover decisions use these checks plus policies (how many failures, over what time window, etc.).

Typical HA Architectures

2-Node Active–Passive Cluster

Basic pattern:

Two identical nodes, shared storage or replicated storage.
One node is primary, the other standby.
Cluster monitors:

Service health
Node heartbeat

On failure:

Fences failed node
Mounts storage (or promotes replica)
Starts service + VIP on standby

Considerations:

2-node clusters require special quorum rules or a third quorum device to avoid split-brain.
Fencing is critical because both nodes can see the shared storage.

N-Node Active–Active Web Tier

Multiple web servers (stateless as much as possible).
Fronted by:

External load balancers, or
Software load balancers on Linux.

Failure of a node: remove it from load balancer; remaining nodes handle traffic.

Key design points:

Sessions:

Move state to shared storage (cache, DB) to avoid per-node stickiness, or
Use sticky sessions at load balancer.

Scaling:

Auto-scaling groups or manual addition of nodes.

Database HA Patterns (Logical Overview)

Common archetypes:

Shared storage failover:

Single DB instance at a time on shared storage.
Simple conceptual model; storage is a SPOF unless redundant.

Replication-based HA (primary/replica):

Primary node accepts writes.
Replicas receive streamed changes.
On primary failure, a replica is promoted.

Multi-primary (active–active):

More complex conflict resolution and consistency concerns.
Typically requires application support and advanced DB features.

Clusters must align with the DB’s own replication/failover mechanisms and consistency guarantees.

Failure Modes and Design Tradeoffs

Types of Failures

Node failure: server crash, kernel panic, hardware failure.
Network failure:

Node isolated
Switch/router failures
Asymmetric links (A sees B, B doesn’t see A).

Storage failure:

Disk, array, or networked storage problem.

Service failure:

Process crashed or hung
Configuration errors
Resource exhaustion (out of file descriptors, memory).

Human error:

Misconfiguration
Accidental resource stop or deletion.

Each must be considered in testing and runbooks.

Split-Brain Scenarios

Split-brain occurs when:

Cluster partitions into two (or more) groups.
Each group believes it has quorum.
Each group may try to run the same service using the same data.

Consequences:

Data corruption or divergence
Client confusion (multiple VIPs or conflicting DNS)
Difficult manual recovery

Prevention/mitigation:

Correct quorum configuration
Robust fencing
Tie-breakers (witness nodes/disks)
Design chosen to fail safe (stop services) when doubt exists.

Consistency vs Availability (CAP Thinking)

While full CAP theorem discussion is beyond this chapter, you should recognize:

Some systems choose to remain consistent at the cost of being unavailable during partitions.
Others remain available and accept temporary inconsistency.
For HA design, decide:

Can you tolerate serving stale data?
Is it better to reject requests than risk inconsistency?

Your cluster configuration (failover timing, fencing, write policies) must reflect this choice.

Designing an HA/Clustered Service

Requirements Gathering

Before picking tools:

Define business requirements:

Target availability (e.g., $99.9\%$ vs $99.99\%$)
RTO and RPO
Regulatory/compliance needs (data locality, encryption, audit).

Understand workload:

Read-heavy vs write-heavy
Stateless vs stateful
Latency sensitivity

Budget and complexity tolerance:

Operational expertise available
On-call capabilities
Hardware/licensing budget

Choosing the Right Pattern

Match requirements to patterns:

Small database for internal app, strong consistency, low budget:

2-node active–passive with shared storage or simple primary/replica.

High-traffic web API, mostly stateless:

Horizontally scaled app servers behind load balancers.

Mission-critical storage:

Replicated/distributed storage cluster, possibly with geo-redundancy.

No single pattern fits all; many environments mix these patterns.

Dependencies and Cascading Failures

In clusters, a “service” often depends on multiple components:

Database depends on:

Storage
Network
Underlying OS

Web app depends on:

Application servers
Databases/caches
External services (SMTP, queues, payment gateways)

Mapping dependencies matters because:

Restarting or failing over higher-level services may not fix lower-level problems.
HA logic must respect ordering (e.g., start storage before DB, DB before app).

Operational Aspects of Clustering and HA

Configuration Management and Consistency

Clusters are sensitive to drift:

Different versions or configs on nodes can cause:

Failed failovers
Subtle bugs

Use configuration management tools and/or infrastructure-as-code to:

Ensure all cluster nodes stay in sync
Version and review cluster changes

Testing Failover

You must test:

Node failure:

Physically power off a node or use hypervisor to simulate.

Service failure:

Kill the process, corrupt config, or block port to see response.

Network partition:

Disconnect one NIC or block routes.

Storage failure:

Make underlying volume temporarily unavailable.

Measure:

Actual RTO
Data loss (if any)
Behavior of clients (time-outs, reconnection, retries).

Observability and Alerting

Effective HA operations require:

Metrics:

Node health, resource states, failover counts, replication lag.

Logs:

Cluster events, fencing actions, service start/stop events.

Alerts:

Node down
Resource failed to start/stop
Quorum lost
Fencing failures

Tie cluster logs and metrics into centralized monitoring and alerting systems.

Runbooks and Procedures

Document step-by-step:

How to:

Safely take a node out of service for maintenance
Add/remove nodes from the cluster
Trigger or prevent failover during planned work

What to do during:

Node failure
Split-brain detection
Storage errors

Escalation procedures and ownership.

Runbooks reduce mistakes during emergencies and standardize responses.

Common Anti-Patterns and Pitfalls

No fencing:

Leads to split-brain and data corruption.

2-node clusters without proper quorum/witness:

Frequent split-brain in network glitches.

Treating storage as magically HA:

Single SAN head or single NFS server becomes a massive SPOF.

Overly aggressive failover:

Failing over on transient issues causes “flapping” and instability.

Complexity exceeding team skill level:

Very sophisticated geo-clusters with little operational experience often have lower real-world availability.

High-Level Lifecycle of a Clustered Service

A typical workflow from concept to production:

Design:

Requirements, RTO/RPO, failure modes, patterns.

Lab/Prototype:

Single or small test environment mirroring prod architecture.

Automate:

Configuration management, deployment scripts.

Test:

Functional tests
Failure scenario simulations

Production Rollout:

Gradual ramp-up, pilot users, canary deployments where possible.

Operate and Improve:

Track incidents
Refine configs, thresholds, runbooks
Periodic failover drills (“game days”)

Understanding this lifecycle helps you integrate Pacemaker/Corosync clusters, distributed filesystems, and load balancers into a coherent HA strategy, rather than isolated technologies.

Failover principles

Pacemaker

Corosync

Cluster resource management

Distributed filesystems

Comments

Please login to add a comment.

Don't have an account? Register now!