5.6 Clustering and High Availability

Table of Contents

Introduction

Clustering and high availability describe how you design Linux systems so that services remain online even when hardware, software, or network components fail. At this level you are no longer thinking about a single server, but about groups of servers that cooperate to present one reliable service to users.

High availability, often abbreviated as HA, focuses on keeping services running with minimal downtime. Clustering is one of the main techniques used to achieve high availability, by using multiple machines that share workloads and can take over each other’s roles when needed.

In this chapter you will see the general goals, patterns, and vocabulary of clustering and high availability on Linux. Concrete tools such as Pacemaker, Corosync, distributed filesystems, and load balancers are covered in their own chapters. Here you will instead develop an understanding of why you use clusters, what kinds of clusters exist, and how Linux systems are typically arranged to provide resilient services.

High availability does not mean “no downtime ever.” It means designing systems so that planned and unplanned downtime are minimized, controlled, and predictable.

Availability and Reliability Fundamentals

When you design high availability on Linux, you must think in terms of probabilities instead of certainties. Every component can fail, but you want the overall service to remain available.

A common way to describe availability is as a percentage of uptime over a given period, usually a year. Availability $A$ is often simplified as:

$$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}$$

Here MTBF is mean time between failures, and MTTR is mean time to repair. In practice you will rarely calculate this exactly, but the formula is important conceptually. If you want higher availability, you must either increase the time between failures, reduce the time needed to recover, or both.

Service levels are often described in “nines.” For instance:

2 nines: 99.0% availability is about 3.65 days of downtime per year.
3 nines: 99.9% availability is about 8.76 hours per year.
4 nines: 99.99% availability is about 52.6 minutes per year.
5 nines: 99.999% availability is about 5.26 minutes per year.

Each extra nine becomes significantly more expensive, since you must eliminate or mitigate more and more failure scenarios. On Linux, achieving something like four nines usually involves clustering, redundant hardware, and automated failover.

You cannot get true high availability by simply using “better hardware.” You must assume that every component, including the server, the network, storage, power, and your own configuration, will fail at some point.

Redundancy and Single Points of Failure

High availability is fundamentally about redundancy. If one path to your service fails, another path should be ready to take over. Linux servers in an HA environment usually have redundant components at multiple levels.

A single point of failure, often abbreviated as SPOF, is any component whose failure will bring down the entire service. Examples include a single database server, a single network switch, or a single shared storage device without replication.

On Linux systems you typically reduce SPOFs through:

Redundant nodes, for example multiple web servers instead of one.
Redundant storage, using replication, mirroring, or distributed filesystems.
Redundant networking, such as multiple network interfaces with bonding or teaming and diverse switch connections.
Redundant power, with multiple power supplies and separate power circuits.

Clustering frameworks help remove software related SPOFs. If one Linux node fails or a critical process crashes, the cluster manager can restart the service on another node. Without clustering, you rely on manual intervention or simple local tools that cannot help if the entire machine is unavailable.

In design, it is important to list all possible SPOFs and decide whether to accept the risk or engineer redundancy. Not every SPOF must be eliminated. For some noncritical internal services, a short downtime might be acceptable. High availability should always match business and technical requirements, rather than aiming for maximal redundancy everywhere.

Types of Clusters

Clusters on Linux can be categorized according to their goals. Understanding these categories helps you choose the correct approach for a given workload. Different cluster types use different patterns for data access, failover, and client connectivity.

A high availability cluster focuses primarily on service continuity. It runs the same service on multiple nodes and uses failover if a node or resource fails. Typical use cases are databases, file servers, or application servers that must remain accessible even during individual node failures. Pacemaker and Corosync, which are covered in later chapters, are common components in these setups.

A load balancing cluster focuses on spreading client requests across multiple nodes. The goal is to increase capacity and performance rather than to protect stateful services. However, load balancing can also increase availability, since one node can be removed or fail without taking down the entire service. Linux load balancers such as HAProxy or Nginx are often placed in front of pools of application servers to form such clusters.

A storage cluster focuses on how data is stored and accessed across multiple machines. Distributed filesystems, object storage, and replicated block devices are typical examples. They aim to provide redundancy and sometimes higher throughput by using data striped, mirrored, or allocated across several Linux nodes. High availability services that need shared data often build on top of storage clusters.

A compute or high performance cluster focuses on compute capacity for scientific or batch workloads. These clusters typically rank jobs and schedule them on multiple worker nodes. They do not necessarily offer persistent services in the traditional sense, so they are less about high availability and more about parallel computation.

In practice, real systems are often hybrids. A high availability setup for a web application might use a load balancing cluster at the front, a pool of application servers, and a storage cluster or replicated database at the back. Linux supports each layer with its own dedicated tools and frameworks, which is why subsequent chapters focus on specific technologies.

Active/Passive and Active/Active Architectures

For high availability services running on Linux, one of the central design decisions is whether the cluster will be active/passive or active/active.

In an active/passive configuration, one node provides the service to clients, and another node or nodes remain idle or underutilized, waiting to take over. Only the active node handles traffic for that service. If the active node fails, the passive node becomes active and starts the services or mounts the resources.

Active/passive clusters are conceptually simple. There is a single owner of each resource, such as a virtual IP address or a filesystem mount. The challenge is to transition ownership cleanly and quickly when a failure occurs. Pacemaker is commonly used to orchestrate this changeover, ensuring that resources are stopped on the failed node and started on the new node in a consistent order.

In an active/active configuration, multiple nodes provide the same service simultaneously. Clients may be distributed across the nodes using a load balancer, DNS round-robin, or application logic. If one node fails, the remaining nodes continue serving requests, possibly at reduced capacity.

Active/active setups can be more efficient. You can use the capacity of all nodes all the time instead of leaving passive nodes idle. However, they are harder to design, especially for stateful applications. For example, a replicated database cluster must ensure that all nodes see consistent updates, or must be able to handle conflicts if updates arrive at different nodes concurrently.

Linux applications and middleware often determine which architecture is practical. Stateless HTTP servers are well suited to active/active clusters, while monolithic databases may be more comfortable in active/passive setups, or in specialized multi primary replication modes that require careful tuning and monitoring.

Never design an active/active cluster unless the application and storage layers are explicitly built to support concurrent access from multiple nodes. Incorrect assumptions about shared state can silently corrupt data.

Core Components of a Linux HA Cluster

While details differ, high availability clusters on Linux typically share several common components. Understanding each component’s role clarifies what later tools such as Pacemaker and Corosync are trying to provide.

Cluster membership is the mechanism that lets nodes know which machines are part of the cluster and which are currently reachable. A node must be able to detect when another node has failed or is no longer able to participate. This requires communication channels and heartbeat mechanisms, which are discussed further in connection with Corosync in the next chapter.

A fencing mechanism, sometimes called STONITH which stands for “Shoot The Other Node In The Head,” is used to forcibly remove faulty or unresponsive nodes from the cluster. This can involve powering off a node through remote power control hardware, cutting its network connectivity, or revoking its storage access. Fencing is critical because it prevents split brain situations, where multiple nodes mistakenly believe they are the sole active node and can both write to shared resources.

A resource manager controls the services, IP addresses, storage mounts, and other resources running in the cluster. It knows which node should run which resource and in what order to start, stop, or move them. In the Linux HA ecosystem, Pacemaker plays this role and uses resource agents to start and stop applications in a consistent way.

Cluster communication is usually implemented through a message bus running over the network. Nodes exchange heartbeat messages that indicate they are alive and share updates about the state of resources. This communication layer must be resilient and support features such as encryption, authentication, and ordered message delivery.

Shared storage, or at least consistent storage, is another key component in many HA clusters. Even if you avoid a single shared disk, nodes must still see the same data. Techniques such as replicated block devices or distributed filesystems are used to provide a unified view of data. The choice of storage approach heavily influences the overall design of your cluster.

In a typical Linux HA setup, one or more communication rings transport cluster messages, Pacemaker manages resources, a quorum mechanism determines when the cluster is allowed to act, and fencing ensures that misbehaving nodes cannot damage shared state.

Quorum and Split Brain

Quorum is a concept used to ensure that only a set of nodes that represent a majority of the cluster can make decisions and run resources. It prevents situations where two disjoint groups of nodes both believe they are in charge. Without quorum, a network failure that separates nodes from each other can lead to both sides acting as independent clusters.

A simple example is a three node cluster. If one node loses network connectivity, the remaining two nodes can form a majority, retain quorum, and continue the service. The isolated node no longer has quorum and must not run any shared services. When connectivity is restored, it can rejoin the majority.

A split brain condition occurs when the cluster is divided into two or more groups that cannot communicate and both groups believe they own the same resources. If these resources include writeable storage, each side might apply conflicting updates. When the network heals, the data is inconsistent and can be very difficult to reconcile.

Linux clustering software uses quorum algorithms and fencing to prevent split brain. For example, if a node cannot see enough peers to maintain quorum, it will shut down its clustered services. Fencing can be used to ensure that nodes outside the quorum are disabled, so they cannot interfere with the surviving cluster.

Quorum policies vary depending on the number of nodes and the nature of the services. In a two node cluster, strict majority rules do not work, because each node alone would always represent half of the cluster. This is why two node clusters often use additional techniques such as quorum devices or arbitrators, which act as tie breakers.

Never ignore quorum warnings or disable fencing merely to “make the cluster work.” Doing so may produce apparent stability in the short term but greatly increases the risk of catastrophic data loss in a real failure.

Service Addressing and Client Access

High availability only matters if clients can reach the service reliably. From the perspective of clients, the cluster should look like a single, stable endpoint, regardless of which actual Linux node is currently serving the requests.

A common pattern uses a virtual IP address that can move between nodes. The cluster manager assigns the IP to the active node and updates it when failover occurs. Clients connect to the virtual IP, and the underlying node change is invisible to them as long as the failover is quick.

Another pattern uses load balancers to present a single address and distribute connections across multiple backend nodes. The load balancer itself can be made highly available using redundant instances with their own virtual IPs or BGP based routing advertisements.

DNS can also play a role in directing traffic to multiple cluster nodes. For example, a DNS name may resolve to several IP addresses, and client resolvers might use them in turn or randomly. This is typically slower to react to failures, since DNS records have time to live values, and many clients cache results. DNS based strategies are more appropriate for applications that can handle occasional connection failures or for geographically distributed clusters.

Session handling is another important aspect. Stateful applications might require session persistence, meaning that the same client is consistently directed to the same backend node for the duration of a session. Linux based load balancers and proxies provide mechanisms such as cookie based or IP based stickiness for this purpose.

The choice of service addressing strategy directly affects how you configure the Linux networking stack. Virtual IPs, bonding, VLANs, and routing policies are often involved. The goal is always the same. You want to minimize client facing changes and preserve a stable entry point, even as the internal cluster components change over time.

Testing and Validating High Availability

A cluster that looks fine when idle may behave unexpectedly during real failures. High availability design on Linux must include deliberate, repeatable testing. You must verify that failover works as expected, that data remains consistent, and that monitoring detects the right signals.

Testing begins with planned failover. For example, you ask the cluster manager to move a resource or stop a node gracefully and observe how quickly and cleanly services recover. This confirms that resource dependencies and ordering are configured correctly.

Next, you simulate unplanned failures. You might power off a node abruptly, pull a network cable, or force a service to crash. These tests help you validate that heartbeat mechanisms and fencing react properly. They also reveal subtle race conditions, such as services starting before storage is fully available.

You should also test degraded performance scenarios. For instance, a partially failing disk or a flapping network link might cause intermittent failures rather than clear outages. Monitoring software should be configured to detect and escalate these issues before they cause full service failures or data corruption.

Documentation is part of validation. Each tested scenario should have a described procedure and an expected outcome. Over time, as you update Linux distributions, kernels, or cluster software, you can repeat the same tests to ensure that behavior remains stable.

Never deploy a new high availability configuration to production without having performed controlled failover and failure simulations in a test environment that closely matches your real systems.

Operational Considerations and Trade-offs

High availability introduces complexity. Each additional node, network path, and storage layer adds more moving parts that must be maintained, monitored, and understood. This complexity is itself a risk, especially if operations staff are unfamiliar with clustering concepts or tools.

Patch management becomes more involved. Applying security updates to a highly available Linux cluster usually requires rolling upgrades, where you move resources off one node, patch it, bring it back, and repeat the process for other nodes. You must ensure that version differences between nodes are supported by the clustering software and by any replicated services.

Backup and restore strategies must be compatible with clustering. For example, you need to know whether backups should run from a specific node to avoid loading down active nodes, and how restores affect replicated storage. Backups themselves do not provide availability, but they are essential for recovering from data corruption, which HA alone cannot prevent.

Monitoring and alerting are also more complex. You must watch both individual nodes and cluster level resources. A node may appear healthy from a system level perspective, yet be unable to participate in the cluster due to quorum loss or fencing. Cluster aware monitoring tools or plugins can help interpret these states correctly.

Finally, not every service justifies the cost and effort of clustering. Some internal tools or batch workloads may tolerate a few hours of downtime a year. For those, simpler approaches such as regular backups and a tested manual recovery plan might be sufficient.

Careful analysis of requirements, risk tolerance, and operational capacity is necessary before you commit to high availability designs. Linux provides powerful building blocks for clustering, but they should be applied where they provide clear value.

Summary

Clustering and high availability shift your focus from single Linux servers to groups of cooperating nodes that collectively provide resilient services. You learned the basic concepts of availability, redundancy, and single points of failure, as well as the major types of clusters and the distinction between active/passive and active/active designs.

You also saw the core components that make high availability clusters work, including membership, quorum, fencing, and shared or coordinated storage. Client access techniques such as virtual IPs and load balancing provide stable endpoints despite internal failover, while testing and operational practices ensure that the cluster behaves correctly during real failures.

The remaining chapters in this part of the course will build on these foundations. You will examine specific Linux tools such as Pacemaker and Corosync, learn concrete failover techniques, and explore distributed storage systems that complete the high availability picture.

5.6.1 Failover principles

5.6.2 Pacemaker

5.6.3 Corosync

5.6.4 Cluster resource management

5.6.5 Distributed filesystems