9.4 Cluster-level parallelism

Table of Contents

Scope of Cluster-level Parallelism

Hybrid programs use more than one kind of parallelism at the same time. At the cluster level you combine node-level parallelism inside each compute node with distributed-memory parallelism across many nodes. In practice this almost always means MPI between nodes, and threads or GPU kernels inside each node.

Cluster-level parallelism is about how you use many nodes of an HPC cluster together as one machine. Node-level parallelism is about what happens inside a single node. In this chapter the focus stays outside the node. You will treat each node as a unit that runs one or more MPI processes, and you will reason about how those processes cooperate across the cluster.

MPI as the glue between nodes

Across nodes there is no shared memory. Even in a hybrid program, communication between nodes is still expressed through MPI calls. Threads or GPU kernels must not attempt to directly read or write memory that lives in another node. When data must move between nodes, an MPI process is responsible for moving it.

In a typical hybrid program, MPI processes are mapped to nodes in one of these styles. You can run one MPI process per node, and use many threads or GPUs inside that node. Or you can run several MPI processes per node, each with its own threads or its own GPU. At the cluster level, your MPI configuration, including communicator size and rank layout, defines how many nodes participate and who talks to whom.

The MPI world size often reflects total cluster-level parallelism. For example, with 4 nodes and 2 MPI processes per node, you will have 8 MPI ranks. How you use those 8 ranks is a cluster-level design choice, independent of how many threads each rank spawns inside its node.

Roles of nodes in hybrid programs

Not all nodes have to play equal roles in a hybrid computation. At cluster scale you will sometimes assign special responsibilities to certain nodes. A single master rank can coordinate a set of worker ranks on many nodes. A subset of ranks can act as I/O servers. Some ranks may manage GPUs while others do not.

This differs from node-level parallelism where the threads of one process usually share a symmetric role with similar access to memory and hardware. At cluster level, different nodes may host different parts of the problem, store different subsets of data, or handle different phases of a workflow such as preprocessing, simulation, and postprocessing. Hybrid programming gives you the flexibility to keep these roles explicit at the MPI level, while threads or GPU kernels implement local computation on each node.

Decomposing work across nodes

Cluster-level parallelism starts from a global problem and slices it into chunks that can be assigned to nodes. How you do this decomposition at the cluster level strongly shapes performance and scalability.

A common pattern is to first partition the domain or dataset across MPI ranks. Each rank is mapped to a node, or to a small number of ranks per node. Inside each rank you then further divide its local part among threads or GPU work items. The important point is that the top-level partitioning across MPI ranks decides which subset of data lives on which node. This choice controls how much network communication is needed, and which pairs of nodes must talk.

For structured grid problems you may assign each rank a subdomain of the grid. At large node counts, you might use a 2D or 3D process grid at the MPI level, and then use threading only inside each domain piece. For particle or unstructured problems you might partition a graph or a set of particles to ranks. The quality of this cluster-level partition, meaning how balanced it is and how small the boundaries are, will often matter more than the finer granularity work-sharing among threads.

Cluster-level decomposition also includes the time dimension. You might pipeline different stages of a simulation across different node groups. For example, one group of nodes could run an ensemble of simulations, and another group could analyze results. In such workflows, MPI ranks on different nodes coordinate data handoff between stages, while node-level parallelism simply accelerates each stage locally.

Mapping MPI processes to cluster hardware

On a cluster, each node has a certain topology and connection to the network. Cluster-level parallelism must respect both the internal NUMA layout of nodes and the external network topology. Hybrid programming gives you more options, but also more choices to get wrong.

The MPI launcher, often integrated with the job scheduler, decides how ranks are placed onto nodes. Command line options, host files, or scheduler directives specify how many ranks per node, which nodes to use, and possibly how to group ranks onto sockets. At cluster scale you care about more than just the total count. You want communicates that are frequent and heavy to occur between ranks that reside on nearby nodes in terms of the cluster interconnect.

For example, in a 3D domain decomposition you may want ranks that are neighbors in the 3D rank grid to be physically close in the network. If your interconnect has a fat-tree or dragonfly topology, the system may try to do this mapping automatically. Hybrid programs can reduce the number of MPI ranks and thereby sometimes reduce pressure on the network, but you still need to think about how those ranks are distributed across the available nodes.

Node-level pinning is important but covered elsewhere. At cluster level you will also care about avoiding uneven rank distribution. For instance, if you have 10 ranks and 4 nodes, mapping might be 3, 3, 2, 2 ranks per node. How you choose that distribution can affect memory and network hot spots. Often the best choice is uniform ranks per node, but special patterns appear when some nodes use GPUs and others are CPU-only.

Coordination patterns across nodes

Cluster-level parallelism introduces explicit communication patterns between nodes. Hybrid programs usually follow a few common MPI-level patterns, while using node-level parallel constructs to accelerate the computations between messages.

A simple pattern is domain boundary exchange. Each rank owns a subdomain and interacts with its neighbors via halo or ghost cell updates. At the cluster level, this means nearest neighbor MPI point-to-point communication across the interconnect. Inside each node, threads or GPUs compute new interior values and then pack or unpack boundary data.

Another pattern is global synchronization and reduction. Many algorithms periodically need cluster-wide sums, minima, or maxima. MPI collective operations provide these capabilities. Hybrid programs should minimize how often they perform global collectives at scale. It is often efficient to first combine data within a node, using threads or GPU kernels, then call a single MPI collective with one value per rank per node instead of many.

Work distribution patterns also exist at cluster level. A master node can dynamically assign tasks to worker nodes via MPI. Within each worker, node-level threads or GPU kernels carry out the assigned task. This approach, often called hierarchical tasking, supports dynamic load balancing between nodes without forcing every fine-grained task to travel over the network.

Balancing load across many nodes

In a hybrid environment you must consider balance on two levels. Node-level load balancing tries to keep all threads or GPUs busy inside each node. Cluster-level load balancing tries to keep all participating nodes busy over time. If some nodes finish their local work early and then wait for slower nodes, the overall speed is controlled by the slowest node.

Cluster-level load imbalance can arise from uneven domain partitioning, variable work per element, or hardware heterogeneity. A naive static mapping of equal numbers of elements to each MPI rank may not produce equal runtime per rank across the cluster. Also, node differences such as different numbers of cores or GPUs can introduce imbalance if all nodes receive equal work.

Hybrid parallelism adds flexibility at this level. You can sometimes assign different numbers of MPI ranks or different thread counts per node to reflect heterogeneous hardware. You can also move more of the dynamic scheduling responsibility to node-level mechanisms. For example, at cluster level you might make relatively coarse decisions about which node handles which set of tasks. Within each node, a thread pool or GPU scheduler can dynamically share the local work. This reduces the frequency of cross-node work redistribution.

For highly irregular problems you may employ global dynamic load balancing where MPI processes on different nodes exchange tasks or migrate data. In a hybrid design, it is common to exchange tasks or chunks that are large enough to amortize network cost, then use threads or GPUs to split these chunks into smaller units locally.

Minimizing network traffic in hybrid designs

Cluster-level parallelism is limited by network cost. Node-level parallelism is usually cheaper in terms of latency and bandwidth. A central design principle of hybrid cluster-level parallelism is to use threads or GPUs to reduce the amount of cross-node communication when possible.

If each node aggregates its internal data before communication, you can send fewer, larger messages. For example, node-level loops can pack halo data from all threads into a single buffer per MPI rank. Similarly, threads can compute local partial reductions, and only a small set of aggregated values cross the network. Even if the total number of floating point operations stays the same, the number of network messages and bytes can drop significantly.

Another practical technique is to overlap computation and communication at cluster scale. An MPI rank can start nonblocking communications, then use node-level parallelism to compute on data that does not depend on the incoming messages. When the communication completes, threads or GPUs can finish boundary regions. This pattern reduces the visible cost of network latency. MPI provides nonblocking operations for both point-to-point and collective communication, and hybrid programs take special advantage of these by having abundant intra-node work to do during waits.

Hybrid layouts with fewer MPI ranks per node can be beneficial because they reduce the number of communicating endpoints and message headers across the network. However, each rank then has more data and more computation. Threads and GPUs inside each node must be able to exploit that larger per-rank workload efficiently. The best configuration therefore depends on both the network characteristics and the node-level performance of your application.

Interaction with the interconnect

At cluster level the characteristics of the interconnect, such as latency, bandwidth, topology, and support for special features, directly impact performance. Hybrid programs do not change the physical network, but they change how you use it.

High-performance networks such as InfiniBand or vendor-specific interconnects usually support advanced features like remote direct memory access or collective offload. Many MPI implementations exploit these internally. From the perspective of cluster-level parallelism, you should design communication patterns that can benefit from large, regular exchanges that interconnect hardware and MPI libraries can optimize.

Hybrid configurations allow you to adjust how many MPI ranks share each network interface. If network injection bandwidth per node is limited, using many ranks per node may cause contention for the network interface. Fewer ranks, each with more computation and threads, can sometimes improve effective per-rank bandwidth. On the other hand, if your network is underutilized while CPUs or GPUs are starved for work, a different mix may be better.

Node internal layouts, such as where the network card is attached in a NUMA system, also interact with cluster-level design. Typically the process that calls MPI should run on the socket closest to the network card. Node-level threads should then be arranged so they do not unduly compete for the same resources used by the MPI rank for communication progress.

Scaling behavior at cluster level

Cluster-level parallelism determines how your program behaves when you increase the number of nodes. Strong and weak scaling analyses are performed at this level. The hybrid nature of the code affects these scaling properties, but the primary scaling drivers are still communication volume, synchronization frequency, and load balance between nodes.

When you increase node count while holding the total global problem size fixed, each node gets less work. Communication per node may not shrink proportionally, so the ratio of computation to communication tends to worsen. Hybrid designs can mitigate this by reducing MPI ranks and messages, but only up to a point. Eventually network latency and collective operations will dominate.

In weak scaling, where you increase both problem size and node count so that work per node stays roughly constant, you want the time per step to stay flat. Cluster-level design choices related to global reductions, all-to-all communication, and data redistribution become critical. If an algorithm has communication patterns whose total cost grows faster than linearly with node count, this will show up clearly in weak scaling curves.

Hybrid programs allow you to reuse node-level tuning as you change the number of nodes. For example, you might fix an efficient thread count per node, and then increase the number of nodes through MPI ranks. Cluster-level scaling studies will then focus on how the cost per step changes as you increase the number of ranks, while thread-level efficiency remains mostly constant.

Practical layout examples on clusters

On a real cluster, you will specify cluster-level parallelism through the job scheduler and the MPI launcher. For instance, you may request 8 nodes, and on each node you might run 4 MPI processes with 8 threads each. At cluster level, your application sees 32 ranks that must cooperate through MPI. At node level, each rank uses 8 threads through OpenMP or another threading model.

If the application is GPU accelerated, you might request 8 nodes with 4 GPUs per node. A common cluster-level layout is one MPI rank per GPU, for a total of 32 ranks. Each rank is tied to a GPU on its node. The nodes then cooperate through MPI, while inside each node threads or CUDA kernels exploit device parallelism.

In some workflows, a single job may use different cluster-level layouts at different phases. A pre-processing stage may use many nodes with relatively few threads per node to convert or partition data. A main simulation stage may then use a different mix with more threads or GPU use per node. The cluster-level units (MPI ranks and nodes) and node-level units (threads and GPU kernels) can be configured phase by phase to better match the dominant operations.

Designing for reliability and fault awareness

As node count increases, the chance of at least one node experiencing a problem also increases. Cluster-level parallelism therefore raises new reliability concerns. Although full fault tolerance mechanisms belong elsewhere, you should be aware that many techniques, such as checkpointing, are naturally defined at the cluster level.

Hybrid programs often coordinate checkpoints through MPI collectives. Each rank writes its portion of the data, often in parallel to a shared file system. Node-level threads just help move or compress data locally. In some advanced setups, a subset of nodes may act as checkpoint servers or handle redundancy schemes. Cluster-level parallelism therefore interacts with resilience strategies, because failures typically occur at the node level, and MPI communicators must handle the resulting changes or abort.

At large scales, soft performance faults, such as a slow node or a misconfigured GPU, can behave like a partial failure. Cluster-level designs that depend on strict synchronization between all nodes will suffer from such stragglers. Hybrid approaches that allow some asynchrony between node groups, or that reduce reliance on cluster-wide barriers, can be more resilient to such variations.

Summary of cluster-level concerns in hybrid codes

Cluster-level parallelism in hybrid programs is about how you use many nodes as a coherent, scalable computing resource. You decide how to split the global problem into node-sized units, how many MPI processes each node runs, and how those processes exchange data over the interconnect. Threads or GPUs then accelerate local computation inside each node.

Effective cluster-level design for hybrid codes focuses on a few key goals. You want to keep all nodes busy, minimize and aggregate network traffic, map MPI ranks reasonably to the cluster topology, and maintain good scaling as you grow to more nodes. The choices you make at this level frame the maximum performance and scalability your hybrid application can reach on modern HPC clusters.

Comments

Please login to add a comment.

Don't have an account? Register now!