5.6.1 Failover principles

Table of Contents

Understanding Failover

High availability on Linux servers relies on one central idea: if one component stops working, another should take over automatically, with as little disruption as possible. This automatic handover is what administrators call failover. In a failover design, you expect that things will break, and you build systems that can survive those failures without a long outage and without losing data.

Failover is not the same thing as simple redundancy. Redundancy provides extra capacity or extra copies, but failover adds detection, decision making, and coordinated switching. A spare server that is never used unless an administrator logs in and reconfigures DNS manually is redundant, but it does not provide automatic failover. A high availability cluster connects redundant components with monitoring, a shared view of state, and predefined actions so that services can move between nodes without manual intervention.

A true failover system must detect failures, decide which node should take over, and then start services on that node while preserving data consistency and avoiding split brain.

Types of Failover Architectures

When you design failover, you must choose how the nodes in the cluster will be used during normal operation and during a failure. These patterns have different trade‑offs in cost, complexity, and performance.

In an active passive architecture, one node runs the service during normal operation and another node stays ready to take over. The passive node runs the clustering stack, monitors the active node, and usually has full access to the same data, but it does not serve users. When the active node fails, the cluster promotes the passive node to active and starts services on it. This pattern is simple to understand and can be very predictable, but it leaves some hardware idle.

In an active active architecture, two or more nodes run the same service at the same time and share the load. If one node fails, the remaining nodes keep serving users, often after redistributing traffic. This is common with stateless services such as web frontends, or with stateful services that support clustering, such as some database engines or distributed storage. Active active designs can give both high availability and better performance, but they require careful handling of shared state and are more complex to test and operate.

A variant of active passive is active standby, sometimes used when the standby node also runs other, lower priority services or belongs to a different cluster. In that case, the standby node is not idle, but the high availability role has priority and can preempt other workloads when failover occurs.

There is also N plus 1 and N plus M failover, where a pool of standby nodes can take over for any of several active nodes. For example, three active application servers might share one warm standby server that can replace whichever fails first. This reduces cost compared to one standby per active node, but capacity planning becomes more important because you must ensure that enough standby resources exist to handle multiple failures.

Failure Detection

Failover cannot occur until the cluster knows that something is wrong. Detection is usually done with health checks and timeouts. Nodes monitor each other and sometimes also monitor specific services or shared resources.

At the cluster level, nodes exchange heartbeat messages. A heartbeat can be a small packet sent over a dedicated network link or across a general network. If node A stops receiving heartbeats from node B for a configured timeout, node A starts to suspect that node B has failed. At the same time, there might be a cluster quorum mechanism that checks whether a majority of nodes agree on the current state to avoid conflicting decisions.

At the service level, the cluster does not just check whether a node is alive. It also checks whether the service itself is healthy. This can involve checking if a process is running, if a TCP port is open, or if a more complex test such as an HTTP request returns success. A node can be alive but unable to serve requests correctly, for example when a database daemon is stuck. Proper failover design monitors the real end user behavior of the service, not just low level daemons.

Timeouts are critical here. If the timeout is too short, temporary delays can trigger unnecessary failovers. If it is too long, users experience long outages before failover begins. Choosing and testing realistic timeout values is part of tuning a high availability system.

Failure detection must distinguish between a node that is slow and a node that is truly unreachable, or you risk unnecessary failovers and possible split brain.

Failover Triggers and Policies

The cluster must decide when a failure is bad enough to justify a failover. These decisions follow a policy that you define in advance. Policies usually describe which resources run on which nodes under normal conditions, what should happen on specific failure events, and how many times to try recovery before moving a service.

One common trigger is node failure. If a node stops responding to heartbeats and health checks, the cluster marks it as down and can move all its managed resources to another node. Another trigger is resource failure. If a service fails a health check or exits unexpectedly, the cluster might try to restart it on the same node, and if that fails too many times, it will attempt to move it to a different node.

Policies also control resource ordering and colocation. For example, a database must start before an application that depends on it, and both might need to run on the same node. The cluster configuration describes these relationships, so that when failover occurs, the services start and stop in a safe order.

There is also the question of automatic versus manual failback. After a node recovers from a failure, you can choose to let the cluster move services back to their original preferred node automatically, or you can keep them on the node that took over until an administrator decides to move them. Automatic failback can restore the intended load distribution but may cause extra movement of services, which is risky if the original problem was not truly fixed.

Data Consistency and Shared Storage

For stateful services, successful failover is not only about restarting a process but also about preserving current data and avoiding corruption. Data can be shared through a network filesystem, distributed storage, or synchronous replication built into the application.

In a typical active passive design with shared storage, both nodes have access to the same block device or filesystem, but only one node at a time is allowed to write to it. The cluster uses fencing or resource locking to ensure this rule. When failover occurs, the cluster makes the new node the only one allowed to access the storage, usually by mounting the filesystem there and unmounting it, or forcefully disabling it, on the failed node.

In active active designs, each node might maintain its own copy of data, replicated across nodes. Replication can be synchronous or asynchronous. Synchronous replication waits for data to be written on multiple nodes before confirming success to the client. Asynchronous replication writes locally first and ships changes to other nodes later. Synchronous replication improves data consistency at the cost of higher latency, while asynchronous replication can lose the last few writes if a failure happens at the wrong time. The right choice depends on the application and its tolerance for data loss.

For any failover design that touches persistent data, you must decide in advance whether you prefer availability with possible data loss or stronger consistency with possible downtime. You cannot maximize both at the same time.

Split Brain and Quorum

Split brain is one of the most dangerous failure modes in clusters. It happens when two or more parts of a cluster lose communication with each other but remain individually alive. Each side believes the others are dead and may try to take over the same resources. If both sides write independently to the same data, you can end up with diverging and conflicting states that are very hard to reconcile.

To prevent split brain, clusters use quorum mechanisms. In a quorum system, decisions require agreement from more than half of the voting members. If a network partition splits a cluster into two equal parts, neither side has a majority and they must stop managing shared resources. If one side has a majority, that side can continue to serve the application, and the smaller side must stand down.

Sometimes a cluster uses a quorum device or tie breaker, such as a shared disk or an extra small node, to avoid ties in a two node cluster. When connectivity is lost, nodes check which side can still reach the quorum device, and only that side continues serving.

Another tool to avoid split brain is fencing. Fencing removes a failed or suspected node from the cluster by force so that it cannot access shared resources any more. This can be done by cutting power, disabling network access, revoking storage access, or other low level actions. Only after the cluster is sure that the old node is fenced and inactive does it allow another node to take over resources.

Never permit two independent nodes to think they are both the active owner of the same writable data. Use quorum and fencing to enforce a single writer at all times.

Failover Time and Recovery Objectives

For every system, you should define how much downtime and data loss is acceptable. These definitions are often expressed as Recovery Time Objective and Recovery Point Objective. Although the detailed use of these metrics belongs to broader disaster recovery planning, they are central to failover design.

Recovery Time Objective, often abbreviated as RTO, describes how quickly a service must be restored after a failure. If your RTO is 60 seconds, your entire detection, decision, and failover process must usually complete within one minute. This includes health check intervals, timeout values, and the time needed to start the service on another node.

Recovery Point Objective, often abbreviated as RPO, describes how much data loss, measured in time, is acceptable. An RPO of zero means that the system should not lose any committed data at all, which typically requires synchronous replication and careful application design. An RPO of five minutes means that losing the last five minutes of updates is acceptable after a disaster.

From a simple point of view, you can think of these as inequalities. If $T_{failover}$ is the time it takes from failure to full service restoration, then you want
$$
T_{failover} \leq RTO
$$
Similarly, if $T_{data\_loss}$ is the time difference between the last consistent backup or replica and the failure event, you want
$$
T_{data\_loss} \leq RPO
$$

Shorter RTO and RPO values require more complex clustering, higher quality hardware, more frequent or synchronous replication, and more careful testing. In other words, high availability is a trade off between cost, complexity, and risk.

Health Checks and Graceful Degradation

Good failover design includes robust health checking at several layers. Checking only whether a TCP port is open is not enough for many applications. You often need end to end checks that verify that the system behaves correctly from the user perspective. For a web service, this might mean an HTTP request to an internal health URL that verifies database connectivity and other dependencies.

Health checks should fail fast when something is wrong but also tolerate minor or transient slowdowns. If every small spike in latency triggers failover, the system can enter a loop of unnecessary restarts that harms reliability rather than improving it. Many high availability setups define several levels of health status, such as healthy, degraded, and failed, and they treat them differently.

Graceful degradation is a related concept. Instead of completely failing when a component is overloaded or partly broken, the system might reduce features or accept fewer requests. For example, a web application might disable a nonessential feature while the database cluster is recovering. Failover principles still apply, but not every partial problem has to trigger a full node level failover if an application can handle it more smoothly.

Failback and Stabilization

Once a failure is resolved, the cluster must decide whether to move services back to their original nodes. This process is called failback. Automatic failback can be convenient because it restores the original design where specific nodes carry specific loads. However, it introduces additional transitions, and every transition is a potential risk of another short outage.

For some systems, the safest approach is manual failback. An administrator checks that the repaired node is stable, verifies logs, and then schedules a controlled migration of services at a low traffic time. This reduces surprises but requires operational discipline and clear procedures.

During failback, you must pay attention to data resynchronization. In replicated systems, the node that was offline must catch up with changes that occurred while it was away. Until that synchronization is complete, it should not become the primary source of truth. If failback occurs too soon, clients might see outdated data.

Testing and Simulation of Failover

A failover design is only as good as its behavior in real failures. It is not enough to configure clustering software and trust that it will work. Regular testing and simulation are essential parts of failover principles.

Testing should include simple node shutdowns, abrupt power offs, network partitions, storage failures, and process crashes. Each test should measure actual failover time and verify that users see at most the level of disruption that you designed for. Tests also reveal configuration mistakes such as incorrect dependencies, misconfigured timeouts, and missing fencing methods.

In addition to planned tests, some organizations use fault injection techniques where failures are introduced during normal operation, in a controlled way, to validate that the system is resilient. This approach is sometimes called chaos testing. It must be done carefully, especially in production environments, but it reflects the principle that you should not rely on untested assumptions about how failover behaves.

Never deploy a failover configuration into production without observing at least one full failover and recovery cycle in a safe environment that mimics real workloads as closely as possible.

Design Trade‑offs and Practical Considerations

Each decision in a failover design has side effects. Using more nodes increases redundancy but also increases the complexity of quorum and split brain prevention. Shorter detection intervals reduce downtime but increase the risk of false positives. Stronger consistency mechanisms reduce data loss but can decrease performance.

Capacity planning is also part of failover principles. If one node in a two node active active cluster fails, the remaining node must carry all the load. That means the normal combined load should be less than what a single node can handle. In symbolic form, if each node can handle capacity $C$ and the normal total load is $L$, then for a two node cluster you want
$$
L \leq C
$$
not $L \leq 2C$, because failover reduces your capacity to a single node.

Operational simplicity often improves real availability more than theoretical maximum performance. A configuration that your team understands and can troubleshoot rapidly will withstand real incidents better than a finely tuned but fragile design. Good documentation, clear runbooks, and monitoring that highlights failover events are all part of an effective failover strategy.

By grounding your cluster design in these failover principles detection, controlled decision making, data safety, split brain prevention, and realistic testing you create Linux based services that can remain available even when individual components fail.

Comments

Please login to add a comment.

Don't have an account? Register now!