Table of Contents
Scope of cluster-level parallelism
Cluster-level parallelism is about how your application uses multiple nodes in a cluster at the same time. In hybrid programs, node-level parallelism is usually handled by threads (e.g., OpenMP within a node), while cluster-level parallelism is handled by processes (typically MPI) communicating over a network.
In this chapter, the focus is on:
- How work is distributed across nodes.
- How MPI (or similar models) is used at the inter-node level in hybrid codes.
- How cluster interconnect characteristics affect your design.
- Practical patterns and choices specific to the multi-node level.
Details of MPI syntax, OpenMP constructs, and interconnect technologies are covered in their respective chapters; here we concentrate on how they combine at cluster scale.
Processes, nodes, and ranks
On a multi-node system, you typically have:
- Many nodes.
- Each node with multiple sockets/cores (often with hardware threads).
- One or more MPI processes per node, each of which may spawn threads.
Cluster-level parallelism is primarily about MPI processes (or equivalent) and how they are mapped to nodes:
- Global ranks: each MPI process has a unique rank in
MPI_COMM_WORLD. Cluster-level decisions often depend on this global layout. - Node-local ranks: within a node, processes can be indexed 0, 1, 2, … for mapping to sockets/NUMA domains.
- Resource binding: you decide which process runs on which node and which cores it uses, often via job scheduler options (e.g., SLURM’s
--nodes,--ntasks,--ntasks-per-node,--cpus-per-task).
Cluster-level design concerns:
- How many processes to run per node.
- How much memory each process needs.
- How processes will communicate across nodes versus within nodes.
Hybrid decomposition: across nodes vs. within nodes
In a hybrid MPI + threads program, you partition work in two stages:
- Cluster-level (inter-node) decomposition
- Divide the global problem into chunks assigned to MPI processes.
- Each process is usually bound to one node or a subset of a node.
- Node-level (intra-node) decomposition
- Within each process, threads further subdivide that process’s chunk.
- This typically uses shared-memory parallelism.
For cluster-level parallelism, you choose what each MPI process is responsible for:
- Spatial/domain decomposition (e.g., subdomains of a grid, sets of particles, subsets of a mesh).
- Task decomposition (e.g., different phases or components of a workflow; less common for tightly coupled simulations).
- Data decomposition (e.g., matrix blocks, subsets of rows/columns).
Key decisions at cluster level:
- Granularity: how large a chunk each process owns.
- Shape/layout: for structured domains, whether to use 1D, 2D, or 3D decompositions across processes.
- Ownership: which process is responsible for which data; that determines communication patterns.
Typical process/thread layouts on clusters
On a node with C cores (or hardware threads) you commonly see:
- Pure MPI:
CMPI processes per node, 1 core per process. - Few MPI processes + many threads:
PMPI processes per node,Tthreads per process, withP × T = C.
Cluster-level parallelism is mostly about the MPI part of this configuration.
Common hybrid layouts across the cluster:
- 1 MPI process per node, many threads
- Cluster-level parallelism: number of nodes.
- Node-level parallelism: threads.
- Advantages:
- Fewer MPI ranks → less communication metadata and fewer messages.
- All data on a node is within a single address space (per rank).
- Disadvantages:
- More pressure on shared resources (memory bandwidth) per rank.
- Potential load-balancing issues if nodes are not equally loaded.
- Several MPI processes per node, moderate threads
- E.g., 2–4 MPI processes per node, each bound to a socket/NUMA domain, each with several threads.
- This often matches the hardware topology (sockets, NUMA regions).
- Can reduce NUMA penalties and improve memory locality.
- Many MPI processes, few threads
- Close to pure MPI, but with small threading regions for select kernels.
- Good for codes already MPI-heavy that gain modestly from threading.
When thinking cluster-level, you choose #nodes × MPI ranks per node based on:
- Total memory required.
- Communication volume and pattern.
- Network characteristics (latency/bandwidth).
- Scheduler constraints (max tasks per node, policies).
Communication patterns at cluster scale
Cluster-level communication is dominated by inter-node messages, which are usually:
- Higher latency than intra-node messages.
- Lower bandwidth than memory transfers.
- A bottleneck if used too frequently or with poor patterns.
For hybrid programs, aim to:
- Confine fine-grained communication to within nodes via threads where possible.
- Aggregate data before sending between nodes to reduce message counts.
Typical cluster-level patterns:
- Nearest-neighbor exchanges (e.g., halo/ghost cell updates in PDE solvers).
- Collectives across many nodes (e.g.,
MPI_Allreduce,MPI_Bcast). - Global I/O to parallel filesystems (I/O patterns are covered elsewhere; here note that coordination across nodes is often MPI-based).
Cluster-level optimization typically focuses on:
- Reducing the frequency of messages.
- Increasing the message size (fewer, larger messages).
- Aligning communication phases across processes to avoid idle time.
Rank mapping and process placement
On a cluster, where your MPI ranks land can significantly affect performance:
- Node-level mapping: which ranks share a node.
- Socket/NUMA mapping: which ranks share a socket.
- Network-aware mapping: aligning compute neighbors with network neighbors if the interconnect supports it (e.g., torus, fat-tree).
At cluster level, considerations include:
- Logical topology vs. physical topology:
- For a 3D domain decomposition, you may try to place ranks in a way that minimizes communication distance between neighbors.
- Communicator subgroups:
- Create communicators for node-local groups (
MPI_Comm_split_typewithMPI_COMM_TYPE_SHARED) and for higher-level groupings that might map onto network partitions. - Avoiding oversubscription:
- Ensure processes are not contending for the same cores, especially when adding threads.
These decisions are typically influenced by:
- Job scheduler settings (e.g., SLURM’s
--distribution,--ntasks-per-node). - MPI runtime options (e.g., mapping/binding flags).
Balancing work across nodes
At cluster scale, load imbalance across nodes can dominate runtime, even if threads are well balanced within each node.
Sources of cluster-level imbalance:
- Non-uniform problem structure (e.g., adaptive meshes, irregular graphs).
- Node heterogeneity (older vs newer nodes, thermal throttling).
- Different I/O or data-access patterns across nodes.
Cluster-focused strategies:
- Static partitioning:
- Use problem-aware decomposition tools (e.g., graph partitioners) to distribute work evenly across MPI ranks.
- Ensure each rank gets a roughly equal share of both computation and communication.
- Dynamic or semi-dynamic partitioning:
- Use rebalancing steps that redistribute work across MPI processes at runtime.
- Restrict frequent rebalancing within nodes (via threads) and less frequent, heavier rebalancing across nodes (via MPI) to reduce communication overhead.
Hybrid-specific consideration:
- You can keep the number of MPI processes per node fixed and adjust the distribution of data or tasks among those processes to handle large-scale rebalancing across the cluster.
Scaling behavior across multiple nodes
When you increase the number of nodes, performance is affected in ways specific to cluster-level parallelism:
- Strong scaling limits:
- Eventually, per-node (per-rank) work becomes so small that inter-node communication dominates.
- Hybrid codes can push the strong-scaling limit somewhat further by reducing the MPI process count and using more threads per process.
- Weak scaling challenges:
- Even if work per node is constant, global operations (e.g., reductions) get more expensive as node count grows.
- Communication overhead in collectives is a major cluster-level concern.
Cluster-level strategies to improve scaling:
- Use hierarchical collectives:
- First perform reductions within each node (using threads), then across nodes (using MPI), possibly mimicking or leveraging hierarchical algorithms in MPI.
- Reduce global synchronization points:
- Avoid unnecessary global barriers that force all nodes to wait.
Hybrid-aware use of cluster interconnects
Cluster interconnect properties (latency, bandwidth, topology, offload capabilities) strongly affect how you design cluster-level parallelism:
- Intra-node vs inter-node separation:
- Use shared-memory parallelism (threads) for fine-grained operations.
- Reserve the interconnect for coarser, aggregated messages.
- Overlapping communication and computation:
- Let MPI processes initiate non-blocking inter-node communication.
- Use threads to perform useful computation while waiting for data.
On some systems, libraries and MPI implementations can exploit:
- Shared-memory transports for ranks on the same node.
- RDMA/offload capabilities for inter-node messages.
Cluster-level parallelism should be organized so that:
- Communication calls are placed in locations where there is enough independent work to overlap.
- Only a subset of threads (often one per MPI process) interacts directly with the interconnect, avoiding contention.
Practical cluster-level job configuration
When launching hybrid jobs across a cluster, typical user decisions include:
#nodes: how many nodes to request from the scheduler.#MPI ranks per node: how many processes per node.#threads per rank: matching cores per rank.
Typical configuration reasoning:
- Memory-limited problems:
- You might increase the number of nodes primarily to increase available memory, adjusting
#ranks × #threadsto use cores efficiently. - Communication-limited problems:
- Prefer fewer MPI ranks (larger subdomains per rank), more threading, fewer inter-node neighbors.
- Compute-limited problems:
- Maximize utilization of all cores across all nodes, while still minimizing communication overhead.
Cluster-level parallelism is measured and tuned by:
- Varying the number of nodes while keeping work fixed (strong scaling tests).
- Inspecting communication time vs computation time in performance profiles.
- Adjusting the process/thread layout to reduce cross-node traffic.
Summary of cluster-level focus in hybrid codes
For hybrid parallel programs, cluster-level parallelism:
- Uses processes (typically MPI) to scale across nodes.
- Manages how problem data and work are partitioned over the cluster.
- Determines the primary inter-node communication patterns and overheads.
- Interacts with hardware topology and job scheduling policies.
- Must be designed in tandem with node-level threading to achieve good performance and scalability on large systems.