Kahibaro
Discord Login Register

5.6.3 Corosync

Role of Corosync in a Cluster Stack

Corosync is the messaging and membership layer commonly used in Linux HA clusters, especially together with Pacemaker. In the overall stack:

You normally do not run Pacemaker without Corosync in a classic “Corosync + Pacemaker” stack; they are tightly integrated but are separate projects and daemons.

Corosync Architecture Overview

Key architectural concepts specific to Corosync:

In practice you will mostly interact with the configuration file, daemons, and corosync-* tools; the internal APIs are used by cluster-aware applications (like Pacemaker) rather than directly by admins.

Installing and Enabling Corosync

Installation is distribution-specific but follows the same pattern.

Examples:

  sudo dnf install corosync corosync-qdevice
  sudo systemctl enable --now corosync
  sudo apt install corosync corosync-qdevice
  sudo systemctl enable --now corosync
  sudo zypper install corosync corosync-qdevice
  sudo systemctl enable --now corosync

On many “HA cluster” stacks, additional packages (such as Pacemaker) are installed at the same time; this chapter focuses on the Corosync side only.

Corosync Configuration File (`corosync.conf`)

The main configuration file is usually:

It is cluster-wide: every node in the same cluster must have the same corosync.conf (except for local directives like nodelist ringX_addr which are node-specific but consistent across the list).

A minimal modern configuration using udpu might look like:

totem {
    version: 2
    secauth: on
    cluster_name: mycluster
    transport: udpu
}
nodelist {
    node {
        ring0_addr: node1.example.com
        nodeid: 1
    }
    node {
        ring0_addr: node2.example.com
        nodeid: 2
    }
}
quorum {
    provider: corosync_votequorum
}
logging {
    to_syslog: yes
    to_stderr: no
    logfile: /var/log/corosync/corosync.log
    timestamp: on
}

You will usually copy and adapt such a file across all nodes.

`totem` Section

The totem block configures the messaging protocol and cluster basics. Common parameters:

Multiple Rings in `totem`

To use more than one network path, you can configure multiple rings, e.g.:

totem {
    version: 2
    secauth: on
    cluster_name: mycluster
    transport: udpu
    interface {
        ringnumber: 0
        bindnetaddr: 192.168.10.0
        mcastport: 5405
    }
    interface {
        ringnumber: 1
        bindnetaddr: 10.10.10.0
        mcastport: 5407
    }
}

Corosync can automatically fail over to another ring if one network path fails, improving cluster robustness.

`nodelist` Section

The nodelist defines all nodes in the cluster:

nodelist {
    node {
        name: node1
        ring0_addr: 192.168.10.11
        ring1_addr: 10.10.10.11
        nodeid: 1
    }
    node {
        name: node2
        ring0_addr: 192.168.10.12
        ring1_addr: 10.10.10.12
        nodeid: 2
    }
}

Key elements:

Corosync does not use automatic discovery in typical Pacemaker setups; you list all nodes explicitly in nodelist.

`quorum` Section

Quorum is the mechanism used to ensure that only a subset of nodes (with a majority or valid vote) can run cluster resources.

Quorum on Corosync is typically handled by the votequorum service:

quorum {
    provider: corosync_votequorum
    expected_votes: 2
    two_node: 1
}

Key parameters:

You will typically not do detailed quorum policy here; that’s handled by Pacemaker, which uses the quorum information Corosync provides.

`logging` Section

Corosync logging is configured via the logging block:

logging {
    fileline: off
    to_syslog: yes
    to_stderr: no
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    timestamp: on
    debug: off
}

Common parameters:

Authentication and Cluster Keys

With secauth: on, Corosync uses a shared key for message authentication and encryption. This key is stored in:

The file must:

To generate the key:

sudo corosync-keygen

This will:

Distribute the generated key securely to all nodes, e.g.:

  sudo scp /etc/corosync/authkey root@node2:/etc/corosync/authkey
  sudo chown root:root /etc/corosync/authkey
  sudo chmod 600 /etc/corosync/authkey

Do not edit authkey manually; always regenerate when you need a new key.

Starting, Stopping, and Status

Corosync is managed through systemd on most distributions.

Common operations:

# Start Corosync
sudo systemctl start corosync
# Enable at boot
sudo systemctl enable corosync
# Check current status
systemctl status corosync
# Stop Corosync
sudo systemctl stop corosync
# Restart (after config changes)
sudo systemctl restart corosync

When running with Pacemaker, cluster management tools may expect both services to be running; always ensure that changes to Corosync are coordinated with the rest of the cluster.

Inspecting Cluster Membership and Quorum

Corosync provides several CLI tools to inspect membership and quorum. These are particularly useful to verify that Corosync itself is healthy before investigating resource-manager issues.

`corosync-cmapctl`

Displays key-value pairs in Corosync’s configuration and runtime database (cmap / confdb):

# Show all keys and values
corosync-cmapctl
# Filter by category, e.g., runtime membership
corosync-cmapctl | grep runtime.members

Helpful keys:

Use this tool when you need low-level detail about what Corosync believes about the cluster state.

`corosync-quorumtool`

Shows quorum status and membership:

corosync-quorumtool

Typical output:

Quorum information
------------------
Date:             Fri Dec 12 10:22:34 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/12345
Quorate:          Yes
Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2node, Quorate
Membership information
----------------------
    Nodeid      Name
         1      node1 (local)
         2      node2

Key aspects:

You can also query only membership:

corosync-quorumtool -s

This provides a one-line summary, convenient for quick checks or scripts.

Two-Node Clusters and `two_node` vs QDevice

Two-node clusters are common but tricky, because:

Corosync provides two main concepts for this:

  1. two_node mode in votequorum (simple / lab setups):
    • two_node: 1 and expected_votes: 2.
    • The remaining single node can still be quorate after its peer fails.
    • Does not protect you from network partitions where both think the other is gone.
    • Suitable mainly for simple or non-critical setups.
  2. QDevice (Quorum Device) (recommended for production two-node clusters):
    • A separate “arbitrator” node or service (qnetd) that provides an additional vote.
    • Implemented via the corosync-qdevice daemon and a qnetd server.
    • Typical scenario:
      • Node1, Node2, and a QDevice (e.g., small VM elsewhere).
      • Cluster uses 3 votes.
      • Any partition without at least 2 votes (e.g., one node alone) becomes non-quorate.

To use QDevice, you would:

Deep QDevice configuration is beyond this chapter; the key point is that Corosync has built-in support to handle quorum in small and asymmetric clusters more safely than two_node alone.

Using Multiple Rings for Redundancy

Multiple rings let Corosync continue working even if one network path fails.

Typical configuration steps:

  1. Provide separate interfaces and IP ranges, e.g.:
    • ring0 on ens33 with 192.168.10.x.
    • ring1 on ens34 with 10.10.10.x.
  2. Configure nodelist with both addresses:
   nodelist {
       node {
           name: node1
           ring0_addr: 192.168.10.11
           ring1_addr: 10.10.10.11
           nodeid: 1
       }
       node {
           name: node2
           ring0_addr: 192.168.10.12
           ring1_addr: 10.10.10.12
           nodeid: 2
       }
   }
  1. Ensure totem interfaces match these networks.

Corosync will:

You can see ring-related status via corosync-cmapctl (e.g., keys under runtime.totem.pg.mrp.srp).

Common Operational Tasks

Below are tasks specifically tied to Corosync management and troubleshooting, not Pacemaker resources.

Rolling Corosync Configuration Changes

When updating corosync.conf:

  1. Edit the file on one node and validate syntax:
   sudo corosync-cfgtool -R

(-R only checks and reloads for some options; not all settings are reloadable.)

  1. Distribute the identical file to all other nodes:
   sudo scp /etc/corosync/corosync.conf node2:/etc/corosync/
  1. Restart Corosync one node at a time:
    • Ensure the cluster remains quorate and functional between restarts.
   sudo systemctl restart corosync
  1. Verify cluster membership after each restart:
   corosync-quorumtool

Some settings (like transport, ring definitions) require a full restart. Keep cluster impact in mind.

Checking Logs

Corosync logs are crucial when diagnosing membership problems:

    journalctl -u corosync
    sudo less /var/log/corosync/corosync.log

Common message patterns:

Use timestamps and Ring IDs to correlate with observed resource failovers or Pacemaker events.

Typical Corosync Problems and How to Approach Them

A few frequent Corosync-specific issues and troubleshooting angles:

  1. Node not joining the cluster:
    • Verify corosync.conf is identical on all nodes.
    • Check authkey presence and permissions.
    • Confirm network connectivity (ping between ring addresses).
    • Look at journalctl -u corosync for membership or authentication errors.
  2. Frequent membership flapping (nodes joining/leaving):
    • Network instability:
      • Packet loss, congestion, or interface flaps.
    • Token timeout too low:
      • Consider increasing token in totem to allow more time before declaring failure.
    • Misconfigured MTU:
      • Ensure consistent MTU across the cluster networks.
  3. Split-brain scenarios in two-node clusters:
    • Using two_node without a tie-breaker:
      • Consider implementing QDevice.
    • Unreliable inter-node link:
      • Improve network redundancy or quality; consider multiple rings.
  4. Corosync won’t start after config change:
    • Syntax error in corosync.conf:
      • Use corosync-cfgtool or run Corosync in the foreground temporarily to see errors:
       sudo systemctl stop corosync
       sudo corosync -f

(Press Ctrl+C to stop, then fix the issue.)

  1. One ring fails but cluster survives:
    • Check ring status via corosync-cmapctl.
    • Investigate physical interface, VLAN, or routing problems.
    • Ensure both rings are truly independent (ideally separate switches/paths).

Summary

Corosync is the low-level cluster engine providing:

For effective use in high-availability clusters:

Views: 89

Comments

Please login to add a comment.

Don't have an account? Register now!