5.6.4 Cluster resource management

Table of Contents

Understanding Cluster Resource Management

Cluster resource management is the discipline of deciding what runs where in a cluster, keeping it running when things fail, and moving it when circumstances change. In the context of high availability on Linux, it is largely about how a cluster resource manager, often Pacemaker with Corosync as the messaging layer, monitors and controls services, filesystems, IP addresses, and other resources across multiple nodes.

The focus here is not on how to build the cluster itself, nor on the low level messaging, but on how resources are modeled, started, stopped, monitored, and constrained so that applications remain available without manual intervention.

Resources, Actions, and Resource Agents

In a cluster, almost everything that matters to your application is modeled as a resource. A resource can be a virtual IP address, a filesystem mounted from shared storage, a database instance, a web server process, a load balancer, or even an arbitrary script.

The cluster does not hard code the logic to start or stop any specific application. Instead, it calls resource agents. A resource agent is an executable program that knows how to perform actions such as start, stop, monitor, and sometimes promote or demote for a specific type of resource. On Linux with Pacemaker, resource agents typically follow the OCF (Open Cluster Framework) standard or use legacy LSB or systemd interfaces.

The resource manager tracks the state of each resource by running monitor actions at defined intervals. If a monitor action reports that a resource is failed or not running where it should be, the resource manager can attempt recovery by issuing a stop then start on an appropriate node. If recovery fails or misbehavior is detected, the node may be fenced to protect data integrity.

A resource is therefore defined by its type, its parameters, and the actions that should run at particular intervals. For example, a filesystem resource will have parameters such as the device, mount point, and filesystem type, while a database resource might have connection details and data directory paths.

A cluster resource must be idempotent and deterministic. Running the same start or stop action multiple times must not corrupt data or leave the service in an unknown state.

Resource Types and Classes

Within a high availability cluster, resources are grouped into classes. Common classes are OCF, LSB, systemd, and sometimes special internal types. Within each class, you select a provider and a resource type.

For OCF resources, the type is often given as ocf:provider:resource. For example, ocf:heartbeat:IPaddr2 for a floating IP address or ocf:heartbeat:Filesystem for a shared filesystem. LSB and systemd classes map to init scripts and systemd units respectively.

There are two broad categories of resources that matter for management decisions. Primitive resources represent individual service elements such as a single IP, a single filesystem, or a single daemon. Composite resources are built from primitives and include groups, clones, and master or slave resources for replicated or stateful services.

Understanding which type of resource matches your application affects how the resource manager is allowed to place and move it. Stateless services can often be cloned easily, while stateful services with a notion of primary and secondary require master or slave semantics and more careful handling.

Resource Placement and the Cluster Information Base

The cluster resource manager maintains an in memory representation of desired and actual state for all nodes and resources, often called the Cluster Information Base, or CIB. This includes resource definitions, node information, and all configured constraints.

Periodically, or when something changes, a built in policy engine computes a cluster wide plan. This plan decides which resources should be running on which nodes, which failures must be recovered, and whether any migrations or failovers are necessary. The calculation uses the resource configurations, node attributes, and constraints, and it assigns scores to candidate nodes for each resource.

The core idea is that each resource has a target role and a set of placement preferences. A node may be eligible or ineligible to run the resource. Eligible nodes are scored, and the one with the highest score is chosen. If conditions change, such as a node going offline or recovering from a failure, the computation runs again and a new plan is derived.

Cluster placement decisions are score based. A resource is placed on the eligible node with the highest score. If all scores are negative, the resource is not started anywhere.

This score based system is flexible. You can express preferences by adding or subtracting scores through constraints or dynamic rules. You can also let the resource manager avoid nodes with recent failures by applying additional negative scores or even permanent bans.

Failover, Fencing, and Recovery

Resource management in a high availability cluster is not only about normal placement. It is also about what happens when something fails. When a monitor action detects a failure, the resource manager first determines whether the failure is transient or persistent. It considers how many failures have occurred, on which node, and within what time window.

If the failure count exceeds defined thresholds, or if a failure is considered fatal, the manager stops trying to restart the resource on that node. It then attempts to move the resource to another node that is healthy and capable of running it. This is often called failover.

However, in clustered systems that protect data, just stopping or restarting a resource is often not sufficient. If a node appears unreachable or misbehaving, it may still have access to shared storage or may still think it is the active instance of a service. To ensure that no two nodes concurrently modify the same data, the cluster may perform fencing, which forcibly cuts the node off from shared resources by power cycling it or disabling its network or storage paths.

The resource manager only proceeds with failover of certain resources after fencing confirmation. This prevents split brain scenarios where two nodes believe they are both primary for a stateful resource and simultaneously modify the same data.

In many setups, you can configure how many restart attempts should occur locally before promoting failover and how long past failures count against a node. This affects how aggressively the manager removes resources from an unstable node.

Resource Dependencies and Ordering

Individual resources rarely exist in isolation. For an application to run correctly, certain services must start in a particular order and stay together. For example, a database might require that the filesystem with its data is mounted and that a virtual IP is assigned to the same node.

The cluster resource manager models these relationships using constraints. Two of the fundamental constraint types are ordering and colocation. An ordering constraint describes which resource must start before another and which one should stop first when tearing down. A typical example is that a filesystem must start before the database and must stop after the database.

Colocation describes where resources may or must run in relation to each other. When you say that one resource is colocated with another, you instruct the cluster that if the second resource is running, the first must run on the same node. You can express colocation as mandatory or as a soft preference using scores.

Together, ordering and colocation constraints allow you to define multi tier stacks. For a simple two tier service, you might enforce that the IP, filesystem, and daemon all start in order on the same node and fail over together as a unit.

Every ordering relationship that represents a functional dependency should usually be paired with a corresponding colocation rule. Otherwise, a resource might start after its dependency but on a different node, which breaks the application.

By carefully defining constraints, you avoid starting an application before its data is available or assigning a service IP to a node that is not yet running the service behind it. In more complex clusters, you can express elaborate dependency trees and even symmetric or asymmetric relationships where some dependencies are not required in all directions.

Resource Groups and Service Stacks

When several resources always belong together, such as an IP, a filesystem, and a daemon that must always be on the same node and follow the same start and stop order, managing them individually with many explicit constraints becomes inconvenient and error prone. To simplify, resource managers often provide groups.

A resource group is an ordered list of resources that acts like a single logical unit. Resources in a group start in the order they appear and stop in reverse order. They are implicitly colocated, so they all run on the same node. Moving or disabling the group affects all members together.

Groups are particularly useful for simple service stacks where one node at a time hosts the entire stack and where cloning or multi primary behavior is not required. With a group, the cluster can still perform failover of the entire stack, but configuration remains concise.

Internally, the resource manager still tracks each primitive in the group, monitors their health, and can detect which part has failed. Recovery decisions, however, may operate on the group level. If the daemon fails repeatedly, the entire group may be moved to another node so that both the IP and the filesystem follow.

Groups are not suitable for every resource. For example, a master or slave database setup requires a different composite structure, and stateless services that require instance level control across multiple nodes are usually managed with clones instead.

Clones and Multi Instance Resources

Some resources are intended to run on every node or on several nodes simultaneously. Examples include a cluster wide messaging service, a file synchronization agent, or a load balancing proxy that benefits from multiple instances for performance. For such cases, the resource manager provides clones.

A clone creates multiple instances of a primitive resource across nodes. Cloned resources are monitored and managed as individual instances but share a single definition. The manager decides how many instances to run and on which nodes. You can configure properties such as whether instances must run on all nodes or only a subset, and whether they are unique or anonymous.

Anonymous clones make every instance equivalent. The cluster does not care which specific instance is which, and any node that is eligible can host a clone instance. Unique clones assign a distinct identity to each instance, which matters when the resource must maintain per instance state or configuration.

Multi instance management introduces additional decisions. The resource manager must consider not just whether the resource is running somewhere, but also whether the desired number of instances are running and whether any node is overloaded. For example, if you want three instances across a five node cluster, the manager balances them according to scores and constraints.

Clones can also be combined with groups or with more advanced semantics, but at that point it becomes important to clearly understand which parts of a stack may safely be multi primary and which must remain single primary per cluster or per data set.

Master and Slave Resources and Roles

Some distributed systems, such as replication based databases or block replication tools, have the concept of different roles per instance. Typically, one instance acts as a primary or master and others act as secondaries or slaves. Only the master may accept writes, and failover involves promoting a previously secondary instance.

To model this, resource managers provide master or slave constructs. These are specialized clones where each instance can be in one of two roles, usually Master or Slave. The cluster knows how many master instances are allowed or required and how many slave instances may run concurrently.

The resource agent for a master or slave resource supports actions such as promote and demote in addition to start and stop. The resource manager calls these actions when it decides that an instance should change its role. It also monitors the role status and can change placement when failures occur.

Role management introduces further constraints. For example, you may require that clients connect only to the node that currently hosts the master. In that case, a virtual IP or load balancer resource must be ordered and colocated with the master role, not merely with the base resource. If the master moves, the IP or other access point must move with it to ensure data consistency.

In a master or slave setup, at most one master may be allowed to run at a time for a given data set. Multiple masters can cause data corruption unless the application explicitly supports multi primary operation.

By managing roles centrally, the resource manager can automate complex failover scenarios such as promoting a secondary, demoting the previous primary, and shifting external access points, all within a single consistent plan.

Node Attributes and Resource Stickiness

The cluster resource manager bases many decisions on node attributes. These attributes can describe hardware capabilities, such as available memory or presence of SSDs, or logical roles, such as being a preferred primary node or a backup node in a specific site. Attributes can be static, defined in configuration, or dynamic, reported by local agents.

One of the important derived attributes that influences placement is stickiness. Stickiness is an additional score that favors keeping a resource where it is currently running instead of moving it after a transient failure or a small change in the cluster. Without stickiness, resources might move too often, which can cause unnecessary downtime and stress on the system.

By assigning a positive stickiness value, you give the currently hosting node a bonus in the placement calculation. This means that minor differences in other scores may not be enough to cause a migration. Conversely, a negative or low stickiness encourages the cluster to move resources more freely when a better node becomes available.

Node specific attributes can also be used in location constraints. For example, you might set an attribute to indicate that one node has access to a specific storage array and use that attribute to allow or disallow resource placement. This helps the manager avoid trying to start a service on nodes that lack critical dependencies.

Scores, Location Constraints, and Policies

At the heart of resource placement is the score system. Each resource, for each node, has an overall score that is the sum of many contributions. These contributions come from default policies, from resource stickiness, from observed failures, and from location constraints you configure.

A location constraint ties a resource to a node with a particular score. A very high positive score means that the resource strongly prefers that node. A very low negative score effectively bans the resource from that node. Intermediate values express softer preferences.

The cluster may also apply automatic negative scores when a resource has failed on a node within a certain interval. If failures keep happening, the score may drop below zero, and the manager will choose another node for that resource. You can configure how long these negative scores persist and when they should be cleared.

By combining multiple constraints, you can implement policies such as active or passive roles, site affinity in multi site clusters, or resource segregation where unrelated services avoid running on the same node. The result is a flexible but predictable placement behavior that matches your operational needs.

The effective score for a resource on a node is the sum of all applicable scores. If the total is negative, the resource will not run on that node. If the total is $-\infty$, the node is hard banned for that resource.

Although the resource manager calculates scores automatically, understanding their effect helps you design constraint sets that are clear and maintainable. Explicit, high magnitude scores are often easier to reason about than many small relative values.

Resource Monitoring and Timeouts

To ensure that resources stay healthy after they are started, the cluster runs monitor actions at configured intervals. Each monitor action has an interval and a timeout. The interval defines how often the check runs, such as every 10 seconds or every minute, and the timeout defines how long the cluster waits for a response before considering the action failed.

Timeouts are critical. If they are too short, normal operations may sometimes exceed them and appear as failures. If they are too long, the cluster reacts slowly to real problems, increasing downtime. Choosing appropriate values requires understanding the behavior and startup times of your applications.

The cluster distinguishes between failure of a monitor action and failure of the resource itself. A failed monitor may mean that the resource is not responding as expected or that the node is overloaded or unreachable. The resource manager correlates monitor results with other information, such as node status, to decide whether to restart, move, or fence.

Multiple monitors can exist for a resource. For example, a quick shallow check can run frequently, while a more thorough deep check runs less often with a longer timeout. This allows you to balance responsiveness with accuracy and system load.

Maintenance Mode and Planned Changes

Not all changes in a cluster are the result of failures. Administrators often need to perform maintenance, such as patching kernels, upgrading firmware, or changing storage. To avoid unwanted automated actions during such work, resource managers provide maintenance mode.

When the entire cluster is set to maintenance mode, it suspends automated recovery and placement decisions for resources while usually continuing to monitor. You can then safely restart nodes, restart services manually, and make configuration changes without the cluster trying to compensate.

For more targeted work, you can mark a specific node or resource as unmanaged. Unmanaged resources are not started or stopped by the cluster until you re enable them. This is useful when you need to test a configuration or perform manual diagnostics without interference.

Maintenance features are part of responsible resource management. They make it possible to evolve and repair the cluster without unexpected failovers or data risks that could arise from simultaneous human and automated interventions.

Designing Resource Management for High Availability

Effective cluster resource management is as much about design as it is about individual settings. High availability does not come automatically when you throw services into a cluster. You must think through dependencies, failover strategies, and recovery behavior.

Begin by modeling each service as a set of resources, such as IP, storage, and daemons. Decide whether the service is truly stateful and requires single primary semantics or whether it can be scaled through cloning. Define ordering and colocation relationships to express real world dependencies accurately.

Next, choose sensible defaults for stickiness and failure policies. Decide when you want automatic failover versus manual intervention. Configure timeouts, monitor intervals, and maximum restart attempts based on real application behavior. Always test failover scenarios in a non production environment before relying on them.

Finally, use constraints and node attributes to make placement predictable. Avoid overlapping or contradictory policies that are hard to reason about. Clear, simple rules lead to a cluster that behaves consistently when under stress, which is when resource management truly matters.

Comments

Please login to add a comment.

Don't have an account? Register now!