9.4 Cluster-level parallelism

Table of Contents

Scope of cluster-level parallelism

Cluster-level parallelism is about how your application uses multiple nodes in a cluster at the same time. In hybrid programs, node-level parallelism is usually handled by threads (e.g., OpenMP within a node), while cluster-level parallelism is handled by processes (typically MPI) communicating over a network.

In this chapter, the focus is on:

How work is distributed across nodes.
How MPI (or similar models) is used at the inter-node level in hybrid codes.
How cluster interconnect characteristics affect your design.
Practical patterns and choices specific to the multi-node level.

Details of MPI syntax, OpenMP constructs, and interconnect technologies are covered in their respective chapters; here we concentrate on how they combine at cluster scale.

Processes, nodes, and ranks

On a multi-node system, you typically have:

Many nodes.
Each node with multiple sockets/cores (often with hardware threads).
One or more MPI processes per node, each of which may spawn threads.

Cluster-level parallelism is primarily about MPI processes (or equivalent) and how they are mapped to nodes:

Global ranks: each MPI process has a unique rank in MPI_COMM_WORLD. Cluster-level decisions often depend on this global layout.
Node-local ranks: within a node, processes can be indexed 0, 1, 2, … for mapping to sockets/NUMA domains.
Resource binding: you decide which process runs on which node and which cores it uses, often via job scheduler options (e.g., SLURM’s --nodes, --ntasks, --ntasks-per-node, --cpus-per-task).

Cluster-level design concerns:

How many processes to run per node.
How much memory each process needs.
How processes will communicate across nodes versus within nodes.

Hybrid decomposition: across nodes vs. within nodes

In a hybrid MPI + threads program, you partition work in two stages:

Cluster-level (inter-node) decomposition

Divide the global problem into chunks assigned to MPI processes.
Each process is usually bound to one node or a subset of a node.

Node-level (intra-node) decomposition

Within each process, threads further subdivide that process’s chunk.
This typically uses shared-memory parallelism.

For cluster-level parallelism, you choose what each MPI process is responsible for:

Spatial/domain decomposition (e.g., subdomains of a grid, sets of particles, subsets of a mesh).
Task decomposition (e.g., different phases or components of a workflow; less common for tightly coupled simulations).
Data decomposition (e.g., matrix blocks, subsets of rows/columns).

Key decisions at cluster level:

Granularity: how large a chunk each process owns.
Shape/layout: for structured domains, whether to use 1D, 2D, or 3D decompositions across processes.
Ownership: which process is responsible for which data; that determines communication patterns.

Typical process/thread layouts on clusters

On a node with C cores (or hardware threads) you commonly see:

Pure MPI: C MPI processes per node, 1 core per process.
Few MPI processes + many threads: P MPI processes per node, T threads per process, with P × T = C.

Cluster-level parallelism is mostly about the MPI part of this configuration.

Common hybrid layouts across the cluster:

1 MPI process per node, many threads

Cluster-level parallelism: number of nodes.
Node-level parallelism: threads.
Advantages:

Fewer MPI ranks → less communication metadata and fewer messages.
All data on a node is within a single address space (per rank).

Disadvantages:

More pressure on shared resources (memory bandwidth) per rank.
Potential load-balancing issues if nodes are not equally loaded.

Several MPI processes per node, moderate threads

E.g., 2–4 MPI processes per node, each bound to a socket/NUMA domain, each with several threads.
This often matches the hardware topology (sockets, NUMA regions).
Can reduce NUMA penalties and improve memory locality.

Many MPI processes, few threads

Close to pure MPI, but with small threading regions for select kernels.
Good for codes already MPI-heavy that gain modestly from threading.

When thinking cluster-level, you choose #nodes × MPI ranks per node based on:

Total memory required.
Communication volume and pattern.
Network characteristics (latency/bandwidth).
Scheduler constraints (max tasks per node, policies).

Communication patterns at cluster scale

Cluster-level communication is dominated by inter-node messages, which are usually:

Higher latency than intra-node messages.
Lower bandwidth than memory transfers.
A bottleneck if used too frequently or with poor patterns.

For hybrid programs, aim to:

Confine fine-grained communication to within nodes via threads where possible.
Aggregate data before sending between nodes to reduce message counts.

Typical cluster-level patterns:

Nearest-neighbor exchanges (e.g., halo/ghost cell updates in PDE solvers).
Collectives across many nodes (e.g., MPI_Allreduce, MPI_Bcast).
Global I/O to parallel filesystems (I/O patterns are covered elsewhere; here note that coordination across nodes is often MPI-based).

Cluster-level optimization typically focuses on:

Reducing the frequency of messages.
Increasing the message size (fewer, larger messages).
Aligning communication phases across processes to avoid idle time.

Rank mapping and process placement

On a cluster, where your MPI ranks land can significantly affect performance:

Node-level mapping: which ranks share a node.
Socket/NUMA mapping: which ranks share a socket.
Network-aware mapping: aligning compute neighbors with network neighbors if the interconnect supports it (e.g., torus, fat-tree).

At cluster level, considerations include:

Logical topology vs. physical topology:

For a 3D domain decomposition, you may try to place ranks in a way that minimizes communication distance between neighbors.

Communicator subgroups:

Create communicators for node-local groups (MPI_Comm_split_type with MPI_COMM_TYPE_SHARED) and for higher-level groupings that might map onto network partitions.

Avoiding oversubscription:

Ensure processes are not contending for the same cores, especially when adding threads.

These decisions are typically influenced by:

Job scheduler settings (e.g., SLURM’s --distribution, --ntasks-per-node).
MPI runtime options (e.g., mapping/binding flags).

Balancing work across nodes

At cluster scale, load imbalance across nodes can dominate runtime, even if threads are well balanced within each node.

Sources of cluster-level imbalance:

Non-uniform problem structure (e.g., adaptive meshes, irregular graphs).
Node heterogeneity (older vs newer nodes, thermal throttling).
Different I/O or data-access patterns across nodes.

Cluster-focused strategies:

Static partitioning:

Use problem-aware decomposition tools (e.g., graph partitioners) to distribute work evenly across MPI ranks.
Ensure each rank gets a roughly equal share of both computation and communication.

Dynamic or semi-dynamic partitioning:

Use rebalancing steps that redistribute work across MPI processes at runtime.
Restrict frequent rebalancing within nodes (via threads) and less frequent, heavier rebalancing across nodes (via MPI) to reduce communication overhead.

Hybrid-specific consideration:

You can keep the number of MPI processes per node fixed and adjust the distribution of data or tasks among those processes to handle large-scale rebalancing across the cluster.

Scaling behavior across multiple nodes

When you increase the number of nodes, performance is affected in ways specific to cluster-level parallelism:

Strong scaling limits:

Eventually, per-node (per-rank) work becomes so small that inter-node communication dominates.
Hybrid codes can push the strong-scaling limit somewhat further by reducing the MPI process count and using more threads per process.

Weak scaling challenges:

Even if work per node is constant, global operations (e.g., reductions) get more expensive as node count grows.
Communication overhead in collectives is a major cluster-level concern.

Cluster-level strategies to improve scaling:

Use hierarchical collectives:

First perform reductions within each node (using threads), then across nodes (using MPI), possibly mimicking or leveraging hierarchical algorithms in MPI.

Reduce global synchronization points:

Avoid unnecessary global barriers that force all nodes to wait.

Hybrid-aware use of cluster interconnects

Cluster interconnect properties (latency, bandwidth, topology, offload capabilities) strongly affect how you design cluster-level parallelism:

Intra-node vs inter-node separation:

Use shared-memory parallelism (threads) for fine-grained operations.
Reserve the interconnect for coarser, aggregated messages.

Overlapping communication and computation:

Let MPI processes initiate non-blocking inter-node communication.
Use threads to perform useful computation while waiting for data.

On some systems, libraries and MPI implementations can exploit:

Shared-memory transports for ranks on the same node.
RDMA/offload capabilities for inter-node messages.

Cluster-level parallelism should be organized so that:

Communication calls are placed in locations where there is enough independent work to overlap.
Only a subset of threads (often one per MPI process) interacts directly with the interconnect, avoiding contention.

Practical cluster-level job configuration

When launching hybrid jobs across a cluster, typical user decisions include:

#nodes: how many nodes to request from the scheduler.
#MPI ranks per node: how many processes per node.
#threads per rank: matching cores per rank.

Typical configuration reasoning:

Memory-limited problems:

You might increase the number of nodes primarily to increase available memory, adjusting #ranks × #threads to use cores efficiently.

Communication-limited problems:

Prefer fewer MPI ranks (larger subdomains per rank), more threading, fewer inter-node neighbors.

Compute-limited problems:

Maximize utilization of all cores across all nodes, while still minimizing communication overhead.

Cluster-level parallelism is measured and tuned by:

Varying the number of nodes while keeping work fixed (strong scaling tests).
Inspecting communication time vs computation time in performance profiles.
Adjusting the process/thread layout to reduce cross-node traffic.

Summary of cluster-level focus in hybrid codes

For hybrid parallel programs, cluster-level parallelism:

Uses processes (typically MPI) to scale across nodes.
Manages how problem data and work are partitioned over the cluster.
Determines the primary inter-node communication patterns and overheads.
Interacts with hardware topology and job scheduling policies.
Must be designed in tandem with node-level threading to achieve good performance and scalability on large systems.

Comments

Please login to add a comment.

Don't have an account? Register now!