Kahibaro
Discord Login Register

5.6.4 Cluster resource management

Understanding Cluster Resources

In a high-availability (HA) cluster, resources are the things the cluster starts, stops, moves, and monitors to provide services. Typical examples:

Cluster resource management is about:

The concrete tools and syntax differ (e.g., Pacemaker, Corosync integration, etc.), but the concepts below are common across modern Linux HA stacks.

Resource Types and Agents

Most HA stacks use resource agents to manage resources in a standardized way. A resource agent knows how to:

Common agent families:

You typically create a resource by choosing a type/agent and providing parameters. Conceptually:

# Pseudo-example, not tied to a specific tool
create resource vip type=IPaddr2 params ip=10.0.0.100 cidr_netmask=24
create resource web type=apache params configfile=/etc/httpd/conf/httpd.conf

The agent abstracts the complexity. The cluster uses the same start/stop/monitor interface for all resources regardless of their internal details.

Resource States and Lifecycle

Across cluster solutions, resources usually have a small set of states:

The cluster resource manager (CRM):

  1. Decides target state (e.g., started on some node).
  2. Issues actions: start, stop, monitor, promote, demote, migrate.
  3. Watches results and updates its internal view.
  4. Reacts to failures according to configured policies.

Resource Constraints

Constraints tell the cluster how resources should be arranged. They are the core of resource management.

Most systems have three fundamental constraint types:

Location Constraints

Answer: On which node(s) may this resource run?

Examples:

Conceptually:

location prefer_web_on_node1 web prefer node1
location avoid_db_on_node3  db  ban    node3

Location constraints can be:

Colocation Constraints

Answer: Which resources should run together (or apart)?

Examples:

Conceptually:

colocate web_with_vip web with vip  # Same node
colocate fs_with_drbd filesystem with drbd_primary
colocate never_db_with_backup db with backup score=-INFINITY

Colocation is often used to express resource stacking:

  1. Storage (DRBD, LVM, etc.)
  2. Filesystem mount
  3. Database
  4. Application

Order Constraints

Answer: In what order should resources start or stop?

Examples:

Conceptually:

order start_stack: drbd_primary then filesystem then db then web

Order and colocation often go together:

Resource Groups

A resource group is a simple way to say “these resources go together as a unit.”

Characteristics:

Example conceptual group:

group web_stack \
    vip \
    filesystem \
    web

Instead of creating separate constraints for each pair, you define a group and apply constraints to the group as a whole.

When to use groups:

When not to use groups:

Multi-State Resources (Master/Slave)

Some resources can exist in multiple roles, typically:

Typical examples:

Key points:

Conceptual example:

ms drbd_resource drbd_agent meta master-max=1 clone-max=2
# Filesystem colocated with *Master* instance
colocate fs_with_drbd_master filesystem with drbd_resource:Master
order start_drbd_before_fs: drbd_resource:promote then filesystem:start

Resource Stickiness and Scoring

The scheduler decides where to place resources based on a scoring system:

Important concept: stickiness — how strongly a resource prefers to stay where it is.

Typical use:

Conceptually:

# Make web prefer to stay where it is
set_property resource_stickiness=100
# Or per resource
set_meta web resource-stickiness=200

Additionally, you may use:

to influence placement.

Resource Monitoring and Timeouts

Each resource should be monitored periodically:

Good practice:

Conceptual operation definitions:

op monitor interval=30s timeout=10s on-fail=restart
op monitor role=Master interval=10s timeout=20s on-fail=demote

Monitoring policies affect:

Failure Handling and Recovery Policies

When a resource fails, the cluster follows policies such as:

Typical per-resource failure settings (conceptually):

Example:

set_meta db migration-threshold=3 failure-timeout=60s

Interpretation:

Also common:

Careful tuning is important:

Clones and Distributed Resources

Some resources are intended to run on all or several nodes:

A clone resource represents N identical instances of the same resource.

Properties:

Conceptually:

clone dlm_clone dlm_agent meta clone-max=4 clone-min=2

For services that must be available on multiple nodes simultaneously, clones work in tandem with:

Maintenance Mode and Manual Control

You often need to temporarily override automatic behavior:

Key mechanisms:

Conceptually:

# Stop the cluster from automatically managing resources
set_property maintenance-mode=true
# Stop a single resource
set_meta web target-role=Stopped
# Force-move a resource to another node
migrate web node2

Always use the cluster’s control commands rather than starting/stopping services directly with systemctl, unless your stack explicitly supports it, to avoid the cluster becoming “confused” about resource state.

Designing a Resource Management Strategy

When setting up cluster resource management in practice, common steps are:

  1. Identify all required resources:
    • IPs, storage, filesystems, applications, supporting daemons.
  2. Define resource agents and parameters:
    • Choose the appropriate agent type (OCF, systemd, etc.).
    • Configure paths, ports, config files.
  3. Model relationships with constraints:
    • Use colocation + order for stacks.
    • Use groups for simple linear stacks.
    • Use multi-state resources where needed (replication).
  4. Plan placement rules:
    • Location constraints for node preferences.
    • Stickiness and scores to reduce churn.
  5. Configure monitoring and timeouts:
    • Different monitor intervals per role if necessary.
    • Ensure timeouts are realistic.
  6. Define failure and recovery policies:
    • migration thresholds
    • failure timeouts
    • fencing behavior (handled in cluster-wide settings and fencing configuration).
  7. Test scenarios:
    • Node failure.
    • Resource failure.
    • Manual migrations.
    • Restart of cluster software itself.

The result is a cluster that behaves predictably under normal operations and failures, with resources placed, started, and recovered according to your explicit policies rather than ad-hoc behavior.

Views: 100

Comments

Please login to add a comment.

Don't have an account? Register now!