5.6.2 Pacemaker

Table of Contents

Introduction

Pacemaker is a cluster resource manager. It does not provide storage, networking, or services by itself. Instead, it decides which node in a cluster should run which resource, and when to stop, start, or move those resources to keep them available. In a high availability stack it usually sits above Corosync, which provides cluster messaging and quorum, and below the actual services that you want to keep highly available, such as IP addresses, file systems, or databases.

This chapter focuses on what is specific to Pacemaker: how it models the cluster, how it makes decisions, which components it uses internally, and how you interact with it in daily administration. Concepts that belong to general clustering or to Corosync are assumed, not repeated here.

Pacemaker Architecture Overview

Pacemaker is split into several cooperating daemons. Together they maintain a consistent view of nodes and resources, and converge on a single cluster-wide decision about what should be running where.

At the center is the Cluster Information Base, or CIB. This is an XML document that describes the cluster configuration and status. It contains definitions of resources and constraints, and also the current state of nodes and resources as reported by local agents. All Pacemaker decisions are based on the CIB. Configuration tools like pcs or crm modify the CIB, not each node individually.

The policy engine reads the CIB and computes the desired state of the cluster. It takes into account node availability, resource definitions, timeouts, constraints, and current failures. From this information it creates an ordered list of actions that must be executed to move the cluster from its current state to the target state. Each recalculation is atomic and results in a so-called transition graph.

A separate controller daemon receives this graph and coordinates the actions across the cluster. It sends start, stop, and monitor requests to the appropriate nodes and waits for the results. If an action fails, the controller may ask the policy engine to recalculate based on the new information.

On each node, the Local Resource Manager handles the actual running and monitoring of resources. It interacts with resource agents that know how to start and stop specific services. Pacemaker itself never calls systemctl or database commands directly. Instead it calls standardized agents, usually in the OCF (Open Cluster Framework) format, which implement operations like start, stop, and monitor.

Pacemaker also includes fencing integration. It does not perform power operations itself, but coordinates stonith agents to power off or reset nodes that are misbehaving. This integration is essential because Pacemaker makes scheduling decisions assuming that a fenced node is completely removed from the cluster.

Core Concepts: Nodes, Resources, and Roles

Pacemaker thinks in terms of nodes and resources. A node is any machine that participates in the cluster. Pacemaker tracks node attributes, such as whether the node is online, its current status, and custom attributes like its role in the cluster. Nodes can be assigned scores that influence where resources are allowed or preferred to run.

Resources are the cluster-managed entities. A resource might represent a virtual IP address, a file system, a database instance, or any other service that can be controlled through a resource agent. Pacemaker itself remains agnostic about the internal details of a resource. It only knows the interface that the agent exposes.

Each resource has a role. The most common roles are Started and Stopped. For multi-state resources, typically database clusters or replication setups, Pacemaker also distinguishes between Master and Slave. The resource agent advertises that it is capable of running in multiple roles, and Pacemaker uses constraints to decide which node, if any, should act as the master at any point in time.

Resource states are derived from monitor operations performed by the resource agents. Pacemaker collects these reports into the CIB and uses them in its decision process. If a resource is reported as failed, Pacemaker may attempt recovery on the same node or move it to another node, depending on resource configuration and failure policies.

Resource Agents and Operations

Pacemaker relies on resource agents to control services. The most commonly used class is OCF agents, located under directories like /usr/lib/ocf/resource.d. These agents follow a standard interface: they implement operations such as start, stop, monitor, promote, and demote and advertise metadata describing available parameters and behaviors.

When you define a resource in Pacemaker, you specify which agent to use and which parameters to pass to it. For example, a virtual IP resource uses an IPaddr2 agent and requires address, netmask, and possibly interface parameters. A database resource uses a specialized agent that knows how to start and test the database safely.

Operations are the actions that Pacemaker may call on resource agents. Common operations include:

start which makes the resource active on a node.

stop which cleanly deactivates it.

monitor which tests whether the resource is healthy while it is supposed to be running.

promote and demote which switch a multi-state resource between slave and master roles.

Each operation has a timeout and, optionally, an interval. A monitor operation with a nonzero interval repeats periodically. With each successful or failed operation the resource agent returns a code that Pacemaker interprets as success, transient failure, permanent failure, or other special situations like "not running". Pacemaker uses these codes to decide when to fail over or to mark a node or resource as faulty.

Careful tuning of timeouts and monitoring intervals is important. Timeouts that are too short can cause false failures on busy systems, which leads to unnecessary recovery actions. Intervals that are too long increase the time to detect real failures. These are Pacemaker specific considerations, separate from general service configuration.

Constraints and the Scoring Model

Pacemaker uses a scoring system to decide the best placement of resources. Every potential placement of a resource on a node receives a total score. Pacemaker then chooses placements with the highest allowed score, as long as no constraint is violated. Scores are integer values. Special constants such as INFINITY are used to denote absolute rules that cannot be overridden by smaller scores.

Several types of constraints contribute to these scores. Location constraints affect where a resource may or should run. A positive score for a resource on a node expresses a preference for that node. A negative score discourages or forbids that placement. For example, if a node has a particular hardware device, you might give related resources a positive score for that node.

Colocation constraints describe which resources should run together or apart. A positive colocation score means that if one resource is running on a node, the other resource is strongly preferred or required to run on the same node. A negative colocation score expresses a desire to separate them, for example to avoid placing two heavy database instances on the same machine.

Order constraints describe in which order resources should start and stop. Pacemaker does not simply start everything at once. It respects ordering rules, such as "start the file system before starting the database" or "stop the application before stopping the file system." These constraints are important for correctness and to avoid data corruption.

All these constraints, with their scores and dependencies, are fed into the policy engine. For each recalculation the engine computes, for each resource and node, a suitability score. The final placement is the one where resources are started on nodes with the highest valid scores, and where no hard rules are violated.

A resource assignment that results in a score of $-\text{INFINITY}$ is absolutely forbidden. Pacemaker will never place that resource on that node as long as this score applies.

Pacemaker also supports more advanced constraint features such as rule-based expressions using time or node attributes. These allow you to express policies like "run this resource on node A during work hours, but allow node B at night," entirely within Pacemaker's scoring framework.

Resource Stickiness and Migration

Pacemaker tracks where a resource is currently running and by default avoids moving it unnecessarily. This property is called stickiness. Each resource can have a stickiness score that is added to the node where it is currently active. As long as the current node is healthy, this stickiness score must be overcome by other constraints before Pacemaker will migrate the resource.

Stickiness is important because moving services causes interruptions, even if brief. If you prefer stable placements to constant rebalancing, you set a higher stickiness value. In contrast, if you want Pacemaker to optimally rebalance after every small change, you can lower or disable stickiness.

In addition to stickiness, Pacemaker supports migration tools. You can request that a resource be moved from one node to another with a single command. Internally, Pacemaker applies a temporary constraint that sets a very negative score for running that resource on the old node. The next policy calculation then chooses a different node, and Pacemaker performs an orderly stop on the old node and start on the new one, respecting order and colocation constraints.

There is also the concept of resource meta attributes such as "target role." Setting the target role to Stopped for a resource tells Pacemaker that it should not run this resource anywhere, even if constraints would normally favor some node. Changing the target role back to Started returns control to the scheduler. This mechanism is a Pacemaker specific way to temporarily override the normal availability rules without removing the configuration.

Failure Handling and Fencing Integration

When Pacemaker detects a failure, it must decide what to do next. It uses a combination of retry counts, failure scores, and resource or node properties to choose between local recovery, remote migration, or giving up and stopping the resource entirely.

Each failed operation increments a failure count on the resource for that node. In Pacemaker, failures also contribute to a negative score for that placement. At a certain threshold, the resource may be banned from the failing node, which prevents Pacemaker from retrying on that node and encourages migration to a different node.

Failure counts can be persistent or may expire after a defined period. Pacemaker uses this to distinguish between temporary glitches and persistent issues. For example, you could configure it so that a single failure does not cause a permanent ban, but repeated failures within a short window do.

Fencing integration is central in failure handling. If Pacemaker loses contact with a node or determines that it is misbehaving, it may request that the node be fenced. The responsible stonith agent then cuts power or resets the node through mechanisms such as IPMI or a power distribution unit. Only after the fencing is confirmed does Pacemaker proceed with failover actions. This prevents situations where two nodes might simultaneously access shared resources.

Pacemaker also supports node-level failure policies. You can configure it to mark nodes as unclean on certain classes of failures and immediately fence them. This behavior is specific to Pacemaker and differs from simpler scripts that might only restart services. The goal is to preserve data integrity in complex cluster setups.

Multi-state and Clone Resources

Pacemaker has built-in support for resources that can be active on more than one node at a time. A clone resource is an instance of a service that is started in parallel on multiple nodes. Typical examples include distributed file systems or monitoring agents. By defining a clone, you tell Pacemaker to run copies of the resource on some or all nodes, according to clone-specific meta attributes.

A multi-state resource is a special kind of clone where instances can be in different roles, most often master and slave. Pacemaker uses both role-specific constraints and the resource agent's capabilities to ensure that the correct number of masters exists and that promotion or demotion happens safely.

Internally, Pacemaker models multi-state resources with additional constraints, like "only one master exists at any time" or "a master must run where the related storage is mounted." The agent must be designed for this purpose and must implement the promote and demote operations. Pacemaker does not guess how to manage master and slave roles, it relies entirely on the resource agent.

Clone and multi-state semantics interact with ordinary colocation and order constraints. For example, collocating a clone with another resource can express that at least one instance of the clone must be present where the other resource runs. These relationships are interpreted by Pacemaker's scheduler according to rules that are specific to the cluster manager and are not part of general Linux service management.

Pacemaker Configuration Interfaces

Although the CIB is stored as XML, administrators rarely edit it directly. Instead, higher level tools are used to manipulate Pacemaker's configuration. Two widely used interfaces are pcs and the crm shell from crmsh. Both provide commands to add nodes, define resources and constraints, and query cluster status. They translate your changes into modifications of the CIB and broadcast those changes to all cluster nodes.

The pcs tool is commonly associated with distributions such as Red Hat Enterprise Linux and its derivatives. It presents a command syntax like pcs cluster and pcs resource and can also manage Corosync configuration. With it you can set up a new cluster, add and remove nodes, and create or delete resources and constraints without dealing with XML.

The crm shell uses an interactive environment and a configuration language that resembles the internal structure of the CIB. It is popular on other distributions, especially those that ship the crmsh package. The shell allows you to enter configuration mode, modify parts of the CIB, and then commit or discard changes.

Both tools support exporting and importing the full cluster configuration. Pacemaker applies versioning information within the CIB, so that every change has a generation counter and timestamp. This enables safe synchronization because a node that rejoins the cluster can detect whether its local CIB copy is outdated and must be replaced.

In the background, Pacemaker uses its remote interface to secure communications. Authentication and authorization for configuration changes are typically handled outside of Pacemaker itself, by using SSH or TLS within the helper tools.

Pacemaker Remote and Guest Nodes

Pacemaker can manage resources not only on traditional cluster nodes but also on remote nodes that do not run the full clustering stack themselves. This feature is known as Pacemaker Remote. It is especially useful for managing virtual machines or containers as part of a cluster.

A remote node is represented within the CIB as a special resource. Pacemaker connects to it over a secure channel through the Pacemaker Remote daemon. Once connected, the cluster treats the remote node like a regular node in terms of scheduling resources. The remote node does not need Corosync or the other Pacemaker daemons. It only needs the remote component and the relevant resource agents.

Guest nodes are a special case of remote nodes where the remote node itself is backed by a resource in the cluster, usually a virtual machine. Pacemaker then manages both the guest's lifecycle and the services inside the guest in a coordinated way. For example, it can ensure that a database service inside a virtual machine only starts after the virtual machine resource is active.

From the Pacemaker perspective, these are simply nodes with a different transport. The same constraints and scoring system apply. The only unique part is the way Pacemaker establishes and maintains the remote connection and how failures of that connection are interpreted. A lost connection to a remote node may trigger recovery of the guest resource itself, not just the services inside it.

Status, Transitions, and Debugging

Pacemaker constantly evaluates cluster state and generates transitions. A transition is a set of actions that, when completed, bring the cluster from the current state to the desired state as computed by the policy engine. Each change in configuration, or each significant state change like a node joining or leaving, triggers a new computation.

You can inspect current cluster status with high level commands. These provide a snapshot listing of nodes, resources, and their states, taken directly from the CIB. Pacemaker also exposes detail views that show which constraints apply to which resources, along with the scores computed for possible placements. This information is essential for understanding why a resource is running on a specific node.

When debugging, it can be helpful to examine the policy engine's transition graphs. These graphs describe the exact sequence of planned operations and the dependencies between them, though they are typically exposed through logs rather than in a graphical form. Logs also record resource agent output and error codes, which help determine whether a problem originates in Pacemaker's decisions or in the underlying service.

Another important debugging feature is the ability to clear failure counts and reset bans. If a resource was moved away from a node after repeated failures and you have fixed the underlying issue, you can clear its failure history so that Pacemaker will consider that node again. This is a Pacemaker specific concept and is separate from clearing operating system logs or restarting services.

Conclusion

Pacemaker is the component in a Linux high availability stack that turns a set of nodes and services into a coordinated cluster. It does this by maintaining a cluster-wide configuration and state in the CIB, using a scoring model with constraints to decide placements, and orchestrating resource agents to start, stop, and monitor services. Its handling of stickiness, failures, fencing, clones, and multi-state resources is specific and often subtle, so understanding these concepts is crucial if you want to build reliable clusters. The surrounding infrastructure provides messaging, storage, and networking, but Pacemaker is where high availability policy is actually enforced.

Comments

Please login to add a comment.

Don't have an account? Register now!