5.6.2 Pacemaker

Table of Contents

Overview of Pacemaker in a Cluster Stack

Pacemaker is a cluster resource manager. In a typical Linux HA stack it sits above:

The cluster communication/membership layer (usually Corosync)
The fencing layer (STONITH agents)
Resource agents (OCF, LSB, systemd, etc.)

Its jobs:

Decide where resources should run
Start/stop/move resources according to policy
React to failures (e.g., restart or failover)
Enforce constraints (ordering, colocation, etc.)
Maintain cluster-wide state (cluster information base, or CIB)

You normally do not use Pacemaker alone: you combine it with Corosync and fencing to get a complete HA cluster.

Core Pacemaker Concepts

Cluster Information Base (CIB)

The CIB is a cluster-wide XML configuration + state database. It contains:

Resource definitions (what exists)
Constraints (how they relate)
Node attributes and status
Operation history (success/failure of actions)

You manipulate the CIB with tools like pcs or crm rather than editing XML directly on modern setups.

Under the hood, CIB changes are versioned and replicated to all nodes; the Designated Controller (DC) node coordinates this.

Designated Controller (DC)

At any time, one node is the DC:

Runs the scheduler
Calculates “desired cluster state”
Issues start/stop/move operations to local LRMs (local resource managers)

If the DC fails, another node takes over automatically.

Resources and Resource Agents

Pacemaker manages resources via resource agents (RAs). Key points:

RAs implement standard actions: start, stop, monitor, sometimes promote/demote.
Common RA standards: OCF, LSB, systemd, service.
Resources are typed, e.g.:

ocf:heartbeat:IPaddr2
ocf:heartbeat:Filesystem
ocf:heartbeat:pgsql

Pacemaker itself doesn’t know how to run PostgreSQL or an IP address; it just calls the agent with parameters you configure.

Resource Classes and Types

When defining a resource you specify:

class – e.g. ocf, systemd, service
provider – often heartbeat or pacemaker for OCF
type – the specific agent name, e.g. IPaddr2

Example conceptually (not focusing on specific front-ends yet):

Class: ocf
Provider: heartbeat
Type: IPaddr2
Parameters: IP address, netmask, interface, etc.

Resource Stickiness and Migration Threshold

Pacemaker uses scores to decide where to place resources. Two important notions:

Resource stickiness: score that expresses how strongly a resource prefers to stay where it is. Higher stickiness = less “flapping” between nodes.
Migration threshold: how many failures on a node are tolerated before Pacemaker prefers to move the resource elsewhere.

These control how aggressively resources are moved after failures or topology changes.

Fencing and STONITH in Pacemaker

Fencing is essential in HA clusters. Pacemaker integrates with STONITH agents:

Pacemaker calls STONITH resources to power off / reset misbehaving nodes.
Only after fencing is confirmed will Pacemaker bring resources online elsewhere if needed.
Fencing decisions are influenced by failure counts, constraints, and node roles.

You configure fencing devices as Pacemaker resources like any other, but they are treated specially by the cluster engine.

Tools and Interfaces

`pcs` vs `crm` vs Low-Level Tools

Pacemaker itself exposes core daemons and XML CIB, but you normally use higher-level tools:

pcs (Pacemaker/Corosync configuration system)

Widely used on RHEL, CentOS, Rocky, Alma, etc.
Manages both Corosync and Pacemaker config.

crm / crmsh

Common in SUSE environments.

Older or more direct tools (cibadmin, crm_mon, crm_resource, crm_node) still exist and are useful for troubleshooting.

Whichever front-end you use, all are ultimately manipulating the same CIB and resource definitions.

High-Level vs Low-Level Configuration

Pacemaker supports two broad configuration approaches:

High-level constructs:

Resource groups
Clones
Multi-state (master/slave) resources

Low-level constraints:

Location
Colocation
Ordering
Ticket-based (for geo-clusters)

Often you will mix both: use a group for a simple stack (VIP + filesystem + service) and then add specific constraints as needed.

Resource Types in Pacemaker

Primitive Resources

A primitive resource is the basic building block:

Represents one service instance
Defined with a type, class, provider, and parameters
Can have operations defined (monitor intervals, timeouts, etc.)

Examples (conceptually): a database instance, a virtual IP, a filesystem mount.

Resource Groups

Groups are ordered, colocated sets of resources that behave as one unit:

Start order is defined by group order (first in group starts first).
Stop order is reverse of start order.
All members normally reside on the same node.

Use groups for stacks like:

IP address → filesystem → application daemon
Multiple dependent daemons that should always live together

Groups simplify configuration by avoiding explicit ordering and colocation constraints between every pair.

Clones

Clones run the same primitive on multiple nodes simultaneously, e.g.:

A distributed filesystem client
A monitoring agent
A read-only service that can run on all nodes

Important concepts:

clone-max: maximum nodes on which the cloned resource can run
clone-node-max: maximum instances per node

You still may apply constraints to clones (e.g. certain clones should avoid specific nodes).

Multi-State (Master/Slave) Resources

Multi-state resources support master and slave roles:

For replicated services like DRBD, some databases, etc.
Pacemaker manages role transitions via promote / demote operations.
Constraints control which nodes can be masters and how many masters are allowed.

You can express policies like:

“Exactly one master at any time”
“Prefer node A as master, but allow node B if A fails”

Constraints and Scheduling

Pacemaker uses a scoring-based scheduler; constraints modify scores or enforce ordering. The three most used categories:

Location Constraints

Location constraints influence where a resource is allowed or preferred to run:

Prefer or ban specific nodes
Use scores to express preference:

Positive score: prefer
Negative score: avoid
$-\infty$: never run here

Can be static or conditional (based on node attributes)

Use cases:

Keep database primarily on nodes with SSDs
Avoid a node with limited CPU/RAM for heavy services
Implement maintenance windows (temporarily ban resources from a node)

Colocation Constraints

Colocation constraints express togetherness or separation between resources:

“Run resource A on the same node as resource B”
“Do not run A on any node where B runs”
Support scoring to express soft vs hard colocation

Examples:

VIP must run on same node as the application daemon
Filesystem must be on same node as database
Two heavy services should not be colocated

When using groups, basic colocation is handled implicitly within the group, but you still use colocation for interactions between groups or between primitives and clones.

Ordering Constraints

Ordering constraints define start/stop sequence:

“Start A before B”
“Stop B before A” (stop order is usually inferred but can be refined)
Can be mandatory or advisory (with scores)

Common patterns:

Filesystem must start before database
Database must start before application
In failover, higher-level services stop before lower-level ones

Ordering and colocation together let you build predictable service stacks across nodes.

Scores and Decision-Making

Pacemaker’s scheduler calculates a final score per resource per node, combining:

Base preferences (stickiness)
Location constraints
Failures and migration thresholds
Node attributes
Colocation effects (a colocated resource inherits or modifies scores)

If a resource has equal best scores on multiple nodes, Pacemaker may choose based on tie-breakers (e.g., lexicographic node names) unless constrained otherwise.

Operations and Timeouts

Operations: Start, Stop, Monitor, Promote, Demote

Each resource defines operations with parameters:

start: how to start, with timeout
stop: how to stop, with timeout
monitor: periodic health checks (interval, timeout)
promote / demote: for multi-state resources

You can have multiple monitors:

One frequent, lightweight monitor when resource is running
Another less frequent, deeper check

Timeouts and Failure Semantics

Timeouts must be realistic:

Too short → Pacemaker incorrectly thinks the resource failed, may fence a node.
Too long → Failover is slow; clients experience longer downtime.

When an operation fails:

Pacemaker increments a failure count for that resource on that node.
If count exceeds migration threshold, scheduler will avoid that node.
You can define failure policies (ignore, restart, migrate, fence, etc.) depending on severity.

Resource Meta-Attributes

In addition to RA parameters, Pacemaker supports meta-attributes that affect scheduling:

Common examples:

resource-stickiness
migration-threshold
priority (affects which resources are stopped first in low-resource situations)
target-role (e.g., Started, Stopped, Master for multi-state)
is-managed (temporarily let you manage a resource manually)

These are separate from RA parameters like IP addresses, ports, or paths.

Node Management and Cluster Behavior

Node States

Pacemaker tracks node states such as:

online / offline
standby – node is healthy but not allowed to run resources
maintenance – cluster ignores resource failures on this node
unclean – node lost contact and must be fenced before continuing

You can manually put nodes in standby or maintenance mode for safe maintenance.

Quorum and Two-Node Quirks

Pacemaker is quorum-aware:

In clusters with $N$ nodes, a majority (more than $N/2$) is usually required to keep resources running.
Losing quorum typically causes Pacemaker to stop resources to avoid split-brain.

Special attention is needed for:

Two-node clusters:

Often require special Corosync settings or witness devices.
Fencing is critical; Pacemaker must be sure which node is “alive” and allowed to host resources.

Geo-clusters:

May use site-level tickets to decide which site is active.

Pacemaker integrates with the cluster layer’s quorum subsystem; policies like “no-quorum-policy” can be tuned.

Fencing and Recovery Workflow

A typical Pacemaker reaction to a severe failure:

Detect failure (via monitor timeout or Corosync membership changes).
Mark node unclean if it disappears.
Invoke a fence device (STONITH) to power-off/reset the node.
Wait for fencing confirmation.
Recalculate placement and start resources elsewhere, respecting constraints.

Misconfigured fencing can lead to:

Stalemate (no node can be fenced, so cluster refuses to start resources).
“Shoot the wrong node” scenarios, which you avoid with careful device configuration and testing.

Typical Pacemaker Use Patterns

Highly-Available IP + Service

Simplest pattern:

Primitive: IP address resource
Primitive: service (e.g. HTTP daemon)
Group: IP + service (start IP first, then service; both colocated)
Location constraint: prefer node A, allow node B as failover

If node A fails, Pacemaker moves the group to node B; client connections keep using the same IP.

Active/Passive Database with Storage

Common pattern:

Primitive: shared filesystem or replicated volume (DRBD, etc.)
Primitive: database service
Optionally a VIP
Group or combination of ordering + colocation
Multi-state resource for DRBD:

Only allow database to run where DRBD is Master.
Order promotion of DRBD before starting DB.

Pacemaker ensures storage role and service placement are consistent.

Distributed Service via Clones

Pattern for cluster-wide agents/services:

Clone of a primitive: e.g. monitoring agent
clone-max equal to number of nodes
Optional location constraints if some nodes should not run clones

Pacemaker gradually starts/stops clone instances across nodes, respecting overall cluster health.

Monitoring and Troubleshooting Pacemaker

Status and Cluster View

Useful concepts:

Cluster-wide status vs per-node status.
Resource states: Started, Stopped, Master, Slave, Failed.
Historical operations: when and why a resource was restarted or moved.

You can observe:

Which node is DC
Quorum status
Fencing history
Failure counts per resource/node

Common Misconfigurations

Typical issues specific to Pacemaker:

Missing or incorrect fencing:

Cluster refuses to start (“no STONITH, no resources”) depending on policy.
Or risk of split-brain.

Unrealistic timeouts:

Frequent false failure detection and unnecessary failovers.

Overly strict constraints:

No node satisfies all location/colocation rules, so resources remain stopped.

Ignoring failure counts:

Resource repeatedly started on a bad node because migration thresholds are not tuned.

Diagnosing often involves:

Inspecting the CIB
Reviewing operation history
Checking fencing logs
Verifying resource agents work correctly when run manually

Design Considerations for Pacemaker-Based Clusters

When designing with Pacemaker:

Keep the cluster simple where possible:

Fewer resources and constraints = easier reasoning.

Group related services:

Use groups and sensible ordering instead of many separate primitives and constraints.

Model failure modes:

Decide when to restart locally vs move resources vs fence nodes.

Test fencing and failover scenarios:

Simulate node crashes, network losses, and resource failures.

Avoid unnecessary flapping:

Configure adequate stickiness and realistic timeouts.

Consider split-site / geo patterns carefully:

Use tickets, external arbitration, or carefully designed quorum setup.

Pacemaker gives you a powerful policy engine; your main task as an administrator is to express the correct policies (constraints, meta-attributes, fencing strategy) for your environment.

Comments

Please login to add a comment.

Don't have an account? Register now!