Table of Contents
Role of Corosync in a Cluster Stack
Corosync is the messaging and membership layer commonly used in Linux HA clusters, especially together with Pacemaker. In the overall stack:
- Corosync provides:
- Cluster membership (who is in/out of the cluster).
- Quorum calculation.
- Reliable message broadcasting between nodes.
- A configuration database (confdb).
- Pacemaker (covered in the parent chapter) uses Corosync as its communication and membership provider.
You normally do not run Pacemaker without Corosync in a classic “Corosync + Pacemaker” stack; they are tightly integrated but are separate projects and daemons.
Corosync Architecture Overview
Key architectural concepts specific to Corosync:
- Nodes: Each physical/virtual machine participating in the cluster; identified by a
nodeidand usually aname. - Rings:
- Logical communication channels over which Corosync sends messages.
- Usually:
ring0– primary network (e.g., main LAN).ring1– optional redundant network (e.g., second NIC / VLAN).- Each ring is bound to an IP and network interface.
- Totem protocol:
- Corosync’s internal group communication protocol.
- Implements a reliable ordered broadcast using a logical token ring.
- Services / APIs:
cpg(Closed Process Group): Message groups used by Pacemaker and other apps.confdb: Configuration and runtime information service.quorum: Quorum information and notifications.votequorum: Advanced quorum/voting, used for two-node and more complex setups.- Transport:
udpu: Unicast UDP.udp: (older) multicast UDP, often not used in modern configurations due to network restrictions.
In practice you will mostly interact with the configuration file, daemons, and corosync-* tools; the internal APIs are used by cluster-aware applications (like Pacemaker) rather than directly by admins.
Installing and Enabling Corosync
Installation is distribution-specific but follows the same pattern.
Examples:
- RHEL / CentOS / Rocky / Alma:
sudo dnf install corosync corosync-qdevice
sudo systemctl enable --now corosync- Debian / Ubuntu:
sudo apt install corosync corosync-qdevice
sudo systemctl enable --now corosync- SUSE / openSUSE:
sudo zypper install corosync corosync-qdevice
sudo systemctl enable --now corosyncOn many “HA cluster” stacks, additional packages (such as Pacemaker) are installed at the same time; this chapter focuses on the Corosync side only.
Corosync Configuration File (`corosync.conf`)
The main configuration file is usually:
/etc/corosync/corosync.conf
It is cluster-wide: every node in the same cluster must have the same corosync.conf (except for local directives like nodelist ringX_addr which are node-specific but consistent across the list).
A minimal modern configuration using udpu might look like:
totem {
version: 2
secauth: on
cluster_name: mycluster
transport: udpu
}
nodelist {
node {
ring0_addr: node1.example.com
nodeid: 1
}
node {
ring0_addr: node2.example.com
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
}
logging {
to_syslog: yes
to_stderr: no
logfile: /var/log/corosync/corosync.log
timestamp: on
}You will usually copy and adapt such a file across all nodes.
`totem` Section
The totem block configures the messaging protocol and cluster basics. Common parameters:
version: Protocol version; for current Corosync 2.x clusters this is usually2.cluster_name:- Human-readable name for this cluster.
- Should be unique in your environment; used in logging and identification.
transport:udpu: Unicast UDP, recommended on most modern networks.udp: Multicast UDP, used only if your network supports and you need multicast.secauth:on/off: Enables message authentication and encryption.- When
on, Corosync uses a shared key to secure communication. - In production, this should always be
on. token:- Token timeout in milliseconds (e.g.,
token: 3000). - How long a node waits before declaring the token lost and starting a membership change.
- Lower values → faster failure detection but more sensitivity to transient network hiccups.
token_retransmits_before_loss_const:- Number of retransmits before concluding the token is lost.
join,consensus,max_messages, etc.:- Fine-tuning parameters; defaults are acceptable for most environments and tuning is an advanced topic.
Multiple Rings in `totem`
To use more than one network path, you can configure multiple rings, e.g.:
totem {
version: 2
secauth: on
cluster_name: mycluster
transport: udpu
interface {
ringnumber: 0
bindnetaddr: 192.168.10.0
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 10.10.10.0
mcastport: 5407
}
}ringnumber: Logical number of the ring.bindnetaddr: Network to bind to; forudputhis is still used to pick the interface.mcastportis more relevant for multicast setups; withudpuyou’ll commonly rely primarily onringX_addrper node innodelist.
Corosync can automatically fail over to another ring if one network path fails, improving cluster robustness.
`nodelist` Section
The nodelist defines all nodes in the cluster:
nodelist {
node {
name: node1
ring0_addr: 192.168.10.11
ring1_addr: 10.10.10.11
nodeid: 1
}
node {
name: node2
ring0_addr: 192.168.10.12
ring1_addr: 10.10.10.12
nodeid: 2
}
}Key elements:
nodeid:- Unique integer per node, used internally.
- Must not change once the cluster is in production unless you rebuild the cluster.
ring0_addr,ring1_addr, etc.:- IP or hostname on each ring for that node.
- Every node’s
corosync.confmust have the same list, with the same IDs and addresses. name:- Optional; helps with readability and is used by some tools.
Corosync does not use automatic discovery in typical Pacemaker setups; you list all nodes explicitly in nodelist.
`quorum` Section
Quorum is the mechanism used to ensure that only a subset of nodes (with a majority or valid vote) can run cluster resources.
Quorum on Corosync is typically handled by the votequorum service:
quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}Key parameters:
provider:- For modern clusters, use
corosync_votequorum. expected_votes:- Total number of votes in the cluster.
- Often equals the number of nodes but can be changed if you adjust per-node
node_votes. - If not set, Corosync can calculate it from the node list.
two_node:- Special handling for 2-node clusters (set
1to enable). - Primarily for simple two-node environments; production HA often uses a quorum device instead (see below).
You will typically not do detailed quorum policy here; that’s handled by Pacemaker, which uses the quorum information Corosync provides.
`logging` Section
Corosync logging is configured via the logging block:
logging {
fileline: off
to_syslog: yes
to_stderr: no
to_logfile: yes
logfile: /var/log/corosync/corosync.log
timestamp: on
debug: off
}Common parameters:
to_syslog: Log messages through syslog/journald.to_logfile: Log to a specific file.logfile: Path to the log file.debug: Whenon, increases verbosity — useful for troubleshooting but noisy.timestamp: Include timestamps in file logs.fileline: Whenon, adds source file/line information (useful during development or in-depth debugging).
Authentication and Cluster Keys
With secauth: on, Corosync uses a shared key for message authentication and encryption. This key is stored in:
/etc/corosync/authkey
The file must:
- Exist and be identical on all nodes.
- Have strict permissions, usually
600and owned byroot.
To generate the key:
sudo corosync-keygenThis will:
- Create
/etc/corosync/authkey. - Fill it with random data suitable for Corosync authentication.
Distribute the generated key securely to all nodes, e.g.:
- Using
scpover SSH:
sudo scp /etc/corosync/authkey root@node2:/etc/corosync/authkey- Then ensure permissions are correct on all nodes:
sudo chown root:root /etc/corosync/authkey
sudo chmod 600 /etc/corosync/authkey
Do not edit authkey manually; always regenerate when you need a new key.
Starting, Stopping, and Status
Corosync is managed through systemd on most distributions.
Common operations:
# Start Corosync
sudo systemctl start corosync
# Enable at boot
sudo systemctl enable corosync
# Check current status
systemctl status corosync
# Stop Corosync
sudo systemctl stop corosync
# Restart (after config changes)
sudo systemctl restart corosyncWhen running with Pacemaker, cluster management tools may expect both services to be running; always ensure that changes to Corosync are coordinated with the rest of the cluster.
Inspecting Cluster Membership and Quorum
Corosync provides several CLI tools to inspect membership and quorum. These are particularly useful to verify that Corosync itself is healthy before investigating resource-manager issues.
`corosync-cmapctl`
Displays key-value pairs in Corosync’s configuration and runtime database (cmap / confdb):
# Show all keys and values
corosync-cmapctl
# Filter by category, e.g., runtime membership
corosync-cmapctl | grep runtime.membersHelpful keys:
runtime.members– count of members.runtime.members.<nodeid>.name– name of the node.runtime.connections– active connections.
Use this tool when you need low-level detail about what Corosync believes about the cluster state.
`corosync-quorumtool`
Shows quorum status and membership:
corosync-quorumtoolTypical output:
Quorum information
------------------
Date: Fri Dec 12 10:22:34 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 1
Ring ID: 1/12345
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 1
Flags: 2node, Quorate
Membership information
----------------------
Nodeid Name
1 node1 (local)
2 node2Key aspects:
Quorate: Yes/No– whether the cluster currently has quorum.NodesandMembership information– which nodes are in the cluster from Corosync’s perspective.Expected votes,Total votes, andQuorum– how the majority is calculated.
You can also query only membership:
corosync-quorumtool -sThis provides a one-line summary, convenient for quick checks or scripts.
Two-Node Clusters and `two_node` vs QDevice
Two-node clusters are common but tricky, because:
- With two nodes, losing one means you lose strict majority (1 of 2 is not > 50%).
- Without extra measures, you can face split-brain risks.
Corosync provides two main concepts for this:
two_nodemode invotequorum(simple / lab setups):two_node: 1andexpected_votes: 2.- The remaining single node can still be quorate after its peer fails.
- Does not protect you from network partitions where both think the other is gone.
- Suitable mainly for simple or non-critical setups.
- QDevice (Quorum Device) (recommended for production two-node clusters):
- A separate “arbitrator” node or service (qnetd) that provides an additional vote.
- Implemented via the
corosync-qdevicedaemon and a qnetd server. - Typical scenario:
- Node1, Node2, and a QDevice (e.g., small VM elsewhere).
- Cluster uses 3 votes.
- Any partition without at least 2 votes (e.g., one node alone) becomes non-quorate.
To use QDevice, you would:
- Set up a qnetd server (outside of your two main nodes).
- Configure Corosync with
corosync-qdeviceon each node. - Adjust
quorumsection as needed for QDevice votes.
Deep QDevice configuration is beyond this chapter; the key point is that Corosync has built-in support to handle quorum in small and asymmetric clusters more safely than two_node alone.
Using Multiple Rings for Redundancy
Multiple rings let Corosync continue working even if one network path fails.
Typical configuration steps:
- Provide separate interfaces and IP ranges, e.g.:
ring0onens33with192.168.10.x.ring1onens34with10.10.10.x.- Configure
nodelistwith both addresses:
nodelist {
node {
name: node1
ring0_addr: 192.168.10.11
ring1_addr: 10.10.10.11
nodeid: 1
}
node {
name: node2
ring0_addr: 192.168.10.12
ring1_addr: 10.10.10.12
nodeid: 2
}
}- Ensure
toteminterfaces match these networks.
Corosync will:
- Prefer
ring0when it’s healthy. - Automatically switch to
ring1for messages ifring0fails. - Attempt to recover the primary ring once it comes back.
You can see ring-related status via corosync-cmapctl (e.g., keys under runtime.totem.pg.mrp.srp).
Common Operational Tasks
Below are tasks specifically tied to Corosync management and troubleshooting, not Pacemaker resources.
Rolling Corosync Configuration Changes
When updating corosync.conf:
- Edit the file on one node and validate syntax:
sudo corosync-cfgtool -R
(-R only checks and reloads for some options; not all settings are reloadable.)
- Distribute the identical file to all other nodes:
sudo scp /etc/corosync/corosync.conf node2:/etc/corosync/- Restart Corosync one node at a time:
- Ensure the cluster remains quorate and functional between restarts.
sudo systemctl restart corosync- Verify cluster membership after each restart:
corosync-quorumtoolSome settings (like transport, ring definitions) require a full restart. Keep cluster impact in mind.
Checking Logs
Corosync logs are crucial when diagnosing membership problems:
- With
to_syslog: yes: - Logs appear in journald:
journalctl -u corosync- With
to_logfile: yes: - Check:
sudo less /var/log/corosync/corosync.logCommon message patterns:
- Node joins / leaves:
- “
configured nodeid”, “joined the cluster”, “left the cluster”. - Token timeout:
- Indications of link failure or too low token timeout.
- Authentication issues:
- Complaints about
authkeymismatch or missing file.
Use timestamps and Ring IDs to correlate with observed resource failovers or Pacemaker events.
Typical Corosync Problems and How to Approach Them
A few frequent Corosync-specific issues and troubleshooting angles:
- Node not joining the cluster:
- Verify
corosync.confis identical on all nodes. - Check
authkeypresence and permissions. - Confirm network connectivity (ping between ring addresses).
- Look at
journalctl -u corosyncfor membership or authentication errors. - Frequent membership flapping (nodes joining/leaving):
- Network instability:
- Packet loss, congestion, or interface flaps.
- Token timeout too low:
- Consider increasing
tokenintotemto allow more time before declaring failure. - Misconfigured MTU:
- Ensure consistent MTU across the cluster networks.
- Split-brain scenarios in two-node clusters:
- Using
two_nodewithout a tie-breaker: - Consider implementing QDevice.
- Unreliable inter-node link:
- Improve network redundancy or quality; consider multiple rings.
- Corosync won’t start after config change:
- Syntax error in
corosync.conf: - Use
corosync-cfgtoolor run Corosync in the foreground temporarily to see errors:
sudo systemctl stop corosync
sudo corosync -f
(Press Ctrl+C to stop, then fix the issue.)
- One ring fails but cluster survives:
- Check ring status via
corosync-cmapctl. - Investigate physical interface, VLAN, or routing problems.
- Ensure both rings are truly independent (ideally separate switches/paths).
Summary
Corosync is the low-level cluster engine providing:
- Reliable group messaging.
- Membership and quorum information.
- Secure, authenticated communication between cluster nodes.
For effective use in high-availability clusters:
- Design and test your
corosync.confcarefully (rings, nodes, quorum). - Protect communication with
secauthand a properly managedauthkey. - Use the provided tools (
corosync-quorumtool,corosync-cmapctl, logs) to understand and debug cluster state. - For two-node clusters, prefer QDevice over simple
two_nodemode in serious environments. - Combine Corosync with Pacemaker (and other higher-level tools) to manage cluster resources, relying on Corosync for the underlying membership and quorum guarantees.