Table of Contents
Overview of Pacemaker in a Cluster Stack
Pacemaker is a cluster resource manager. In a typical Linux HA stack it sits above:
- The cluster communication/membership layer (usually Corosync)
- The fencing layer (STONITH agents)
- Resource agents (OCF, LSB, systemd, etc.)
Its jobs:
- Decide where resources should run
- Start/stop/move resources according to policy
- React to failures (e.g., restart or failover)
- Enforce constraints (ordering, colocation, etc.)
- Maintain cluster-wide state (cluster information base, or CIB)
You normally do not use Pacemaker alone: you combine it with Corosync and fencing to get a complete HA cluster.
Core Pacemaker Concepts
Cluster Information Base (CIB)
The CIB is a cluster-wide XML configuration + state database. It contains:
- Resource definitions (what exists)
- Constraints (how they relate)
- Node attributes and status
- Operation history (success/failure of actions)
You manipulate the CIB with tools like pcs or crm rather than editing XML directly on modern setups.
Under the hood, CIB changes are versioned and replicated to all nodes; the Designated Controller (DC) node coordinates this.
Designated Controller (DC)
At any time, one node is the DC:
- Runs the scheduler
- Calculates “desired cluster state”
- Issues start/stop/move operations to local LRMs (local resource managers)
If the DC fails, another node takes over automatically.
Resources and Resource Agents
Pacemaker manages resources via resource agents (RAs). Key points:
- RAs implement standard actions:
start,stop,monitor, sometimespromote/demote. - Common RA standards: OCF, LSB, systemd, service.
- Resources are typed, e.g.:
ocf:heartbeat:IPaddr2ocf:heartbeat:Filesystemocf:heartbeat:pgsql
Pacemaker itself doesn’t know how to run PostgreSQL or an IP address; it just calls the agent with parameters you configure.
Resource Classes and Types
When defining a resource you specify:
class– e.g.ocf,systemd,serviceprovider– oftenheartbeatorpacemakerfor OCFtype– the specific agent name, e.g.IPaddr2
Example conceptually (not focusing on specific front-ends yet):
- Class:
ocf - Provider:
heartbeat - Type:
IPaddr2 - Parameters: IP address, netmask, interface, etc.
Resource Stickiness and Migration Threshold
Pacemaker uses scores to decide where to place resources. Two important notions:
- Resource stickiness: score that expresses how strongly a resource prefers to stay where it is. Higher stickiness = less “flapping” between nodes.
- Migration threshold: how many failures on a node are tolerated before Pacemaker prefers to move the resource elsewhere.
These control how aggressively resources are moved after failures or topology changes.
Fencing and STONITH in Pacemaker
Fencing is essential in HA clusters. Pacemaker integrates with STONITH agents:
- Pacemaker calls STONITH resources to power off / reset misbehaving nodes.
- Only after fencing is confirmed will Pacemaker bring resources online elsewhere if needed.
- Fencing decisions are influenced by failure counts, constraints, and node roles.
You configure fencing devices as Pacemaker resources like any other, but they are treated specially by the cluster engine.
Tools and Interfaces
`pcs` vs `crm` vs Low-Level Tools
Pacemaker itself exposes core daemons and XML CIB, but you normally use higher-level tools:
pcs(Pacemaker/Corosync configuration system)- Widely used on RHEL, CentOS, Rocky, Alma, etc.
- Manages both Corosync and Pacemaker config.
crm/crmsh- Common in SUSE environments.
- Older or more direct tools (
cibadmin,crm_mon,crm_resource,crm_node) still exist and are useful for troubleshooting.
Whichever front-end you use, all are ultimately manipulating the same CIB and resource definitions.
High-Level vs Low-Level Configuration
Pacemaker supports two broad configuration approaches:
- High-level constructs:
- Resource groups
- Clones
- Multi-state (master/slave) resources
- Low-level constraints:
- Location
- Colocation
- Ordering
- Ticket-based (for geo-clusters)
Often you will mix both: use a group for a simple stack (VIP + filesystem + service) and then add specific constraints as needed.
Resource Types in Pacemaker
Primitive Resources
A primitive resource is the basic building block:
- Represents one service instance
- Defined with a type, class, provider, and parameters
- Can have operations defined (monitor intervals, timeouts, etc.)
Examples (conceptually): a database instance, a virtual IP, a filesystem mount.
Resource Groups
Groups are ordered, colocated sets of resources that behave as one unit:
- Start order is defined by group order (first in group starts first).
- Stop order is reverse of start order.
- All members normally reside on the same node.
Use groups for stacks like:
- IP address → filesystem → application daemon
- Multiple dependent daemons that should always live together
Groups simplify configuration by avoiding explicit ordering and colocation constraints between every pair.
Clones
Clones run the same primitive on multiple nodes simultaneously, e.g.:
- A distributed filesystem client
- A monitoring agent
- A read-only service that can run on all nodes
Important concepts:
clone-max: maximum nodes on which the cloned resource can runclone-node-max: maximum instances per node
You still may apply constraints to clones (e.g. certain clones should avoid specific nodes).
Multi-State (Master/Slave) Resources
Multi-state resources support master and slave roles:
- For replicated services like DRBD, some databases, etc.
- Pacemaker manages role transitions via
promote/demoteoperations. - Constraints control which nodes can be masters and how many masters are allowed.
You can express policies like:
- “Exactly one master at any time”
- “Prefer node A as master, but allow node B if A fails”
Constraints and Scheduling
Pacemaker uses a scoring-based scheduler; constraints modify scores or enforce ordering. The three most used categories:
Location Constraints
Location constraints influence where a resource is allowed or preferred to run:
- Prefer or ban specific nodes
- Use scores to express preference:
- Positive score: prefer
- Negative score: avoid
- $-\infty$: never run here
- Can be static or conditional (based on node attributes)
Use cases:
- Keep database primarily on nodes with SSDs
- Avoid a node with limited CPU/RAM for heavy services
- Implement maintenance windows (temporarily ban resources from a node)
Colocation Constraints
Colocation constraints express togetherness or separation between resources:
- “Run resource A on the same node as resource B”
- “Do not run A on any node where B runs”
- Support scoring to express soft vs hard colocation
Examples:
- VIP must run on same node as the application daemon
- Filesystem must be on same node as database
- Two heavy services should not be colocated
When using groups, basic colocation is handled implicitly within the group, but you still use colocation for interactions between groups or between primitives and clones.
Ordering Constraints
Ordering constraints define start/stop sequence:
- “Start A before B”
- “Stop B before A” (stop order is usually inferred but can be refined)
- Can be mandatory or advisory (with scores)
Common patterns:
- Filesystem must start before database
- Database must start before application
- In failover, higher-level services stop before lower-level ones
Ordering and colocation together let you build predictable service stacks across nodes.
Scores and Decision-Making
Pacemaker’s scheduler calculates a final score per resource per node, combining:
- Base preferences (stickiness)
- Location constraints
- Failures and migration thresholds
- Node attributes
- Colocation effects (a colocated resource inherits or modifies scores)
If a resource has equal best scores on multiple nodes, Pacemaker may choose based on tie-breakers (e.g., lexicographic node names) unless constrained otherwise.
Operations and Timeouts
Operations: Start, Stop, Monitor, Promote, Demote
Each resource defines operations with parameters:
start: how to start, with timeoutstop: how to stop, with timeoutmonitor: periodic health checks (interval, timeout)promote/demote: for multi-state resources
You can have multiple monitors:
- One frequent, lightweight monitor when resource is running
- Another less frequent, deeper check
Timeouts and Failure Semantics
Timeouts must be realistic:
- Too short → Pacemaker incorrectly thinks the resource failed, may fence a node.
- Too long → Failover is slow; clients experience longer downtime.
When an operation fails:
- Pacemaker increments a failure count for that resource on that node.
- If count exceeds migration threshold, scheduler will avoid that node.
- You can define failure policies (ignore, restart, migrate, fence, etc.) depending on severity.
Resource Meta-Attributes
In addition to RA parameters, Pacemaker supports meta-attributes that affect scheduling:
Common examples:
resource-stickinessmigration-thresholdpriority(affects which resources are stopped first in low-resource situations)target-role(e.g.,Started,Stopped,Masterfor multi-state)is-managed(temporarily let you manage a resource manually)
These are separate from RA parameters like IP addresses, ports, or paths.
Node Management and Cluster Behavior
Node States
Pacemaker tracks node states such as:
online/offlinestandby– node is healthy but not allowed to run resourcesmaintenance– cluster ignores resource failures on this nodeunclean– node lost contact and must be fenced before continuing
You can manually put nodes in standby or maintenance mode for safe maintenance.
Quorum and Two-Node Quirks
Pacemaker is quorum-aware:
- In clusters with $N$ nodes, a majority (more than $N/2$) is usually required to keep resources running.
- Losing quorum typically causes Pacemaker to stop resources to avoid split-brain.
Special attention is needed for:
- Two-node clusters:
- Often require special Corosync settings or witness devices.
- Fencing is critical; Pacemaker must be sure which node is “alive” and allowed to host resources.
- Geo-clusters:
- May use site-level tickets to decide which site is active.
Pacemaker integrates with the cluster layer’s quorum subsystem; policies like “no-quorum-policy” can be tuned.
Fencing and Recovery Workflow
A typical Pacemaker reaction to a severe failure:
- Detect failure (via monitor timeout or Corosync membership changes).
- Mark node unclean if it disappears.
- Invoke a fence device (STONITH) to power-off/reset the node.
- Wait for fencing confirmation.
- Recalculate placement and start resources elsewhere, respecting constraints.
Misconfigured fencing can lead to:
- Stalemate (no node can be fenced, so cluster refuses to start resources).
- “Shoot the wrong node” scenarios, which you avoid with careful device configuration and testing.
Typical Pacemaker Use Patterns
Highly-Available IP + Service
Simplest pattern:
- Primitive: IP address resource
- Primitive: service (e.g. HTTP daemon)
- Group: IP + service (start IP first, then service; both colocated)
- Location constraint: prefer node A, allow node B as failover
If node A fails, Pacemaker moves the group to node B; client connections keep using the same IP.
Active/Passive Database with Storage
Common pattern:
- Primitive: shared filesystem or replicated volume (DRBD, etc.)
- Primitive: database service
- Optionally a VIP
- Group or combination of ordering + colocation
- Multi-state resource for DRBD:
- Only allow database to run where DRBD is Master.
- Order promotion of DRBD before starting DB.
Pacemaker ensures storage role and service placement are consistent.
Distributed Service via Clones
Pattern for cluster-wide agents/services:
- Clone of a primitive: e.g. monitoring agent
clone-maxequal to number of nodes- Optional location constraints if some nodes should not run clones
Pacemaker gradually starts/stops clone instances across nodes, respecting overall cluster health.
Monitoring and Troubleshooting Pacemaker
Status and Cluster View
Useful concepts:
- Cluster-wide status vs per-node status.
- Resource states:
Started,Stopped,Master,Slave,Failed. - Historical operations: when and why a resource was restarted or moved.
You can observe:
- Which node is DC
- Quorum status
- Fencing history
- Failure counts per resource/node
Common Misconfigurations
Typical issues specific to Pacemaker:
- Missing or incorrect fencing:
- Cluster refuses to start (“no STONITH, no resources”) depending on policy.
- Or risk of split-brain.
- Unrealistic timeouts:
- Frequent false failure detection and unnecessary failovers.
- Overly strict constraints:
- No node satisfies all location/colocation rules, so resources remain stopped.
- Ignoring failure counts:
- Resource repeatedly started on a bad node because migration thresholds are not tuned.
Diagnosing often involves:
- Inspecting the CIB
- Reviewing operation history
- Checking fencing logs
- Verifying resource agents work correctly when run manually
Design Considerations for Pacemaker-Based Clusters
When designing with Pacemaker:
- Keep the cluster simple where possible:
- Fewer resources and constraints = easier reasoning.
- Group related services:
- Use groups and sensible ordering instead of many separate primitives and constraints.
- Model failure modes:
- Decide when to restart locally vs move resources vs fence nodes.
- Test fencing and failover scenarios:
- Simulate node crashes, network losses, and resource failures.
- Avoid unnecessary flapping:
- Configure adequate stickiness and realistic timeouts.
- Consider split-site / geo patterns carefully:
- Use tickets, external arbitration, or carefully designed quorum setup.
Pacemaker gives you a powerful policy engine; your main task as an administrator is to express the correct policies (constraints, meta-attributes, fencing strategy) for your environment.