Table of Contents
Goals of Clustering and High Availability
Clustering and high availability (HA) aim to keep services running despite failures. This chapter focuses on:
- Core HA concepts and terminology on Linux
- Typical cluster architectures and components
- How Linux HA building blocks fit together
- Design considerations before you deploy Pacemaker/Corosync, distributed filesystems, load balancers, etc.
Implementation details for specific tools (Pacemaker, Corosync, distributed filesystems, HAProxy, etc.) are covered in their own chapters.
Key Concepts and Terminology
Availability vs Reliability vs Scalability
- Availability: probability a service is usable at a given time. Often expressed as “nines”:
- $99.0\%$ ≈ 3.65 days downtime/year
- $99.9\%$ ≈ 8.76 hours/year
- $99.99\%$ ≈ 52.6 minutes/year
- $99.999\%$ ≈ 5.26 minutes/year
- Reliability: how often a component fails (e.g., disk MTBF).
- Scalability: how easily you can add capacity (more nodes, more CPU, etc.).
HA clustering focuses on availability, sometimes at the cost of complexity.
RTO and RPO
- RTO (Recovery Time Objective): how long your service may be down during an incident.
- RPO (Recovery Point Objective): how much data loss (time window) is acceptable.
Clustering strategies are selected to meet explicit RTO/RPO targets.
Redundancy Models
- Active–Passive: one node serves traffic, another is on standby.
- Simpler logic
- Often easier to guarantee data consistency
- Active–Active: multiple nodes serve traffic simultaneously.
- Better resource usage and scalability
- Requires careful state and data synchronization
Failover and Fallback
- Failover: automatic move of a service from a failed node to a healthy one.
- Failback: moving the service back to the original node once it’s healthy again.
Policies govern when and whether failback happens (immediate, manual, delayed).
Single Points of Failure (SPOFs)
An HA design tries to avoid any single component whose failure breaks the service:
- Hardware: single NIC, single PSU, single disk.
- Network: single switch, single router.
- Software: single database node, single API node.
- Infrastructure: single power feed, single storage array, single DNS provider.
Eliminating SPOFs usually increases cost and complexity; you must prioritize based on business needs.
Types of Clusters
High Availability (Failover) Clusters
- Purpose: keep services running when nodes or processes fail.
- Usually small (2–5 nodes).
- Managed by a cluster stack (e.g., Corosync + Pacemaker).
- Shared state via:
- Shared storage (SAN, NAS, distributed FS), or
- Replication (database replication, block replication, etc.).
Typical use: databases, NFS servers, important stateful services.
Load Balancing Clusters
- Purpose: distribute requests across multiple servers to:
- Handle higher load
- Increase availability
- Often stateless service nodes behind:
- Layer 4 load balancer (TCP)
- Layer 7 load balancer (HTTP-aware)
- Session persistence (“sticky sessions”) can be needed for stateful web apps.
Covered in more detail in the Load Balancing chapter; here you should recognize that load balancing is one HA technique, often combined with backend failover.
Storage Clusters
- Purpose: provide redundant, consistent storage:
- Shared block storage (e.g., iSCSI, Fibre Channel)
- Distributed filesystems (e.g., CephFS, GlusterFS)
- Provide:
- Replication or erasure coding
- Self-healing
- Consistent views of data across nodes
Storage clusters are often a foundation for HA application clusters.
Geo-Clusters (Site-to-Site Failover)
- Spread across multiple locations (data centers / availability zones).
- Handle site-level failures:
- Power outage
- Network partition
- Natural disaster
- Require:
- Cross-site replication for data
- DNS- or IP-based traffic steering
- Strict split-brain protections and clear runbooks
RTO/RPO are typically larger and more complex to manage compared to local failover.
Core Building Blocks in Linux HA
Cluster Membership and Quorum
- Membership: which nodes are considered part of the cluster and “online”.
- Quorum: majority of nodes must agree before the cluster can make changes (like starting or stopping services).
- In a simple N-node cluster:
- Quorum threshold is typically $\lfloor \frac{N}{2} \rfloor + 1$.
- Prevents split-brain: two partitions each believing they’re the real cluster.
Mechanisms include:
- Voting nodes
- Quorum devices (tie-breakers)
- Witness servers or disks
Fencing (STONITH)
When a node appears dead but might still be running, you must fence it before starting its services elsewhere.
- STONITH: “Shoot The Other Node In The Head”
- Poweroff or reset via:
- IPMI/iDRAC/iLO
- Smart PDUs
- Hypervisor APIs
- Prevents:
- Two nodes writing to the same data set (corruption)
- Conflicting network identities (duplicate IPs)
Good fencing is non-optional in serious HA designs.
Virtual IPs and Service Identity
Services often expose a single IP or hostname, regardless of which node currently runs them:
- Virtual IP (VIP):
- Moves between nodes with the service
- Announced via ARP or routing updates
- DNS:
- May be used instead of (or in addition to) VIPs, but DNS caching makes rapid failover tricky.
The cluster manager controls where the VIP lives and associates it with resource health.
Resource Agents and Service Abstractions
In an HA cluster, services are abstracted as resources with standardized actions:
startstopmonitorreloadpromote/demote(for master/slave or primary/replica roles)
These are implemented by resource agents (scripts/programs that know how to manage the real service). The cluster engine uses these to automate failover decisions.
Health Checks and Monitoring
Health is checked at multiple levels:
- Node-level:
- CPU, RAM, basic connectivity, heartbeat messages
- Resource-level:
- Is the process running?
- Is the port open?
- Does a simple query succeed?
- External health:
- Can the service complete a representative request (e.g., HTTP check, DB query)?
Failover decisions use these checks plus policies (how many failures, over what time window, etc.).
Typical HA Architectures
2-Node Active–Passive Cluster
Basic pattern:
- Two identical nodes, shared storage or replicated storage.
- One node is primary, the other standby.
- Cluster monitors:
- Service health
- Node heartbeat
- On failure:
- Fences failed node
- Mounts storage (or promotes replica)
- Starts service + VIP on standby
Considerations:
- 2-node clusters require special quorum rules or a third quorum device to avoid split-brain.
- Fencing is critical because both nodes can see the shared storage.
N-Node Active–Active Web Tier
- Multiple web servers (stateless as much as possible).
- Fronted by:
- External load balancers, or
- Software load balancers on Linux.
- Failure of a node: remove it from load balancer; remaining nodes handle traffic.
Key design points:
- Sessions:
- Move state to shared storage (cache, DB) to avoid per-node stickiness, or
- Use sticky sessions at load balancer.
- Scaling:
- Auto-scaling groups or manual addition of nodes.
Database HA Patterns (Logical Overview)
Common archetypes:
- Shared storage failover:
- Single DB instance at a time on shared storage.
- Simple conceptual model; storage is a SPOF unless redundant.
- Replication-based HA (primary/replica):
- Primary node accepts writes.
- Replicas receive streamed changes.
- On primary failure, a replica is promoted.
- Multi-primary (active–active):
- More complex conflict resolution and consistency concerns.
- Typically requires application support and advanced DB features.
Clusters must align with the DB’s own replication/failover mechanisms and consistency guarantees.
Failure Modes and Design Tradeoffs
Types of Failures
- Node failure: server crash, kernel panic, hardware failure.
- Network failure:
- Node isolated
- Switch/router failures
- Asymmetric links (A sees B, B doesn’t see A).
- Storage failure:
- Disk, array, or networked storage problem.
- Service failure:
- Process crashed or hung
- Configuration errors
- Resource exhaustion (out of file descriptors, memory).
- Human error:
- Misconfiguration
- Accidental resource stop or deletion.
Each must be considered in testing and runbooks.
Split-Brain Scenarios
Split-brain occurs when:
- Cluster partitions into two (or more) groups.
- Each group believes it has quorum.
- Each group may try to run the same service using the same data.
Consequences:
- Data corruption or divergence
- Client confusion (multiple VIPs or conflicting DNS)
- Difficult manual recovery
Prevention/mitigation:
- Correct quorum configuration
- Robust fencing
- Tie-breakers (witness nodes/disks)
- Design chosen to fail safe (stop services) when doubt exists.
Consistency vs Availability (CAP Thinking)
While full CAP theorem discussion is beyond this chapter, you should recognize:
- Some systems choose to remain consistent at the cost of being unavailable during partitions.
- Others remain available and accept temporary inconsistency.
- For HA design, decide:
- Can you tolerate serving stale data?
- Is it better to reject requests than risk inconsistency?
Your cluster configuration (failover timing, fencing, write policies) must reflect this choice.
Designing an HA/Clustered Service
Requirements Gathering
Before picking tools:
- Define business requirements:
- Target availability (e.g., $99.9\%$ vs $99.99\%$)
- RTO and RPO
- Regulatory/compliance needs (data locality, encryption, audit).
- Understand workload:
- Read-heavy vs write-heavy
- Stateless vs stateful
- Latency sensitivity
- Budget and complexity tolerance:
- Operational expertise available
- On-call capabilities
- Hardware/licensing budget
Choosing the Right Pattern
Match requirements to patterns:
- Small database for internal app, strong consistency, low budget:
- 2-node active–passive with shared storage or simple primary/replica.
- High-traffic web API, mostly stateless:
- Horizontally scaled app servers behind load balancers.
- Mission-critical storage:
- Replicated/distributed storage cluster, possibly with geo-redundancy.
No single pattern fits all; many environments mix these patterns.
Dependencies and Cascading Failures
In clusters, a “service” often depends on multiple components:
- Database depends on:
- Storage
- Network
- Underlying OS
- Web app depends on:
- Application servers
- Databases/caches
- External services (SMTP, queues, payment gateways)
Mapping dependencies matters because:
- Restarting or failing over higher-level services may not fix lower-level problems.
- HA logic must respect ordering (e.g., start storage before DB, DB before app).
Operational Aspects of Clustering and HA
Configuration Management and Consistency
Clusters are sensitive to drift:
- Different versions or configs on nodes can cause:
- Failed failovers
- Subtle bugs
- Use configuration management tools and/or infrastructure-as-code to:
- Ensure all cluster nodes stay in sync
- Version and review cluster changes
Testing Failover
You must test:
- Node failure:
- Physically power off a node or use hypervisor to simulate.
- Service failure:
- Kill the process, corrupt config, or block port to see response.
- Network partition:
- Disconnect one NIC or block routes.
- Storage failure:
- Make underlying volume temporarily unavailable.
Measure:
- Actual RTO
- Data loss (if any)
- Behavior of clients (time-outs, reconnection, retries).
Observability and Alerting
Effective HA operations require:
- Metrics:
- Node health, resource states, failover counts, replication lag.
- Logs:
- Cluster events, fencing actions, service start/stop events.
- Alerts:
- Node down
- Resource failed to start/stop
- Quorum lost
- Fencing failures
Tie cluster logs and metrics into centralized monitoring and alerting systems.
Runbooks and Procedures
Document step-by-step:
- How to:
- Safely take a node out of service for maintenance
- Add/remove nodes from the cluster
- Trigger or prevent failover during planned work
- What to do during:
- Node failure
- Split-brain detection
- Storage errors
- Escalation procedures and ownership.
Runbooks reduce mistakes during emergencies and standardize responses.
Common Anti-Patterns and Pitfalls
- No fencing:
- Leads to split-brain and data corruption.
- 2-node clusters without proper quorum/witness:
- Frequent split-brain in network glitches.
- Treating storage as magically HA:
- Single SAN head or single NFS server becomes a massive SPOF.
- Overly aggressive failover:
- Failing over on transient issues causes “flapping” and instability.
- Complexity exceeding team skill level:
- Very sophisticated geo-clusters with little operational experience often have lower real-world availability.
High-Level Lifecycle of a Clustered Service
A typical workflow from concept to production:
- Design:
- Requirements, RTO/RPO, failure modes, patterns.
- Lab/Prototype:
- Single or small test environment mirroring prod architecture.
- Automate:
- Configuration management, deployment scripts.
- Test:
- Functional tests
- Failure scenario simulations
- Production Rollout:
- Gradual ramp-up, pilot users, canary deployments where possible.
- Operate and Improve:
- Track incidents
- Refine configs, thresholds, runbooks
- Periodic failover drills (“game days”)
Understanding this lifecycle helps you integrate Pacemaker/Corosync clusters, distributed filesystems, and load balancers into a coherent HA strategy, rather than isolated technologies.