Kahibaro
Discord Login Register

9.4 Cluster-level parallelism

Scope of cluster-level parallelism

Cluster-level parallelism is about how your application uses multiple nodes in a cluster at the same time. In hybrid programs, node-level parallelism is usually handled by threads (e.g., OpenMP within a node), while cluster-level parallelism is handled by processes (typically MPI) communicating over a network.

In this chapter, the focus is on:

Details of MPI syntax, OpenMP constructs, and interconnect technologies are covered in their respective chapters; here we concentrate on how they combine at cluster scale.

Processes, nodes, and ranks

On a multi-node system, you typically have:

Cluster-level parallelism is primarily about MPI processes (or equivalent) and how they are mapped to nodes:

Cluster-level design concerns:

Hybrid decomposition: across nodes vs. within nodes

In a hybrid MPI + threads program, you partition work in two stages:

  1. Cluster-level (inter-node) decomposition
    • Divide the global problem into chunks assigned to MPI processes.
    • Each process is usually bound to one node or a subset of a node.
  2. Node-level (intra-node) decomposition
    • Within each process, threads further subdivide that process’s chunk.
    • This typically uses shared-memory parallelism.

For cluster-level parallelism, you choose what each MPI process is responsible for:

Key decisions at cluster level:

Typical process/thread layouts on clusters

On a node with C cores (or hardware threads) you commonly see:

Cluster-level parallelism is mostly about the MPI part of this configuration.

Common hybrid layouts across the cluster:

  1. 1 MPI process per node, many threads
    • Cluster-level parallelism: number of nodes.
    • Node-level parallelism: threads.
    • Advantages:
      • Fewer MPI ranks → less communication metadata and fewer messages.
      • All data on a node is within a single address space (per rank).
    • Disadvantages:
      • More pressure on shared resources (memory bandwidth) per rank.
      • Potential load-balancing issues if nodes are not equally loaded.
  2. Several MPI processes per node, moderate threads
    • E.g., 2–4 MPI processes per node, each bound to a socket/NUMA domain, each with several threads.
    • This often matches the hardware topology (sockets, NUMA regions).
    • Can reduce NUMA penalties and improve memory locality.
  3. Many MPI processes, few threads
    • Close to pure MPI, but with small threading regions for select kernels.
    • Good for codes already MPI-heavy that gain modestly from threading.

When thinking cluster-level, you choose #nodes × MPI ranks per node based on:

Communication patterns at cluster scale

Cluster-level communication is dominated by inter-node messages, which are usually:

For hybrid programs, aim to:

Typical cluster-level patterns:

Cluster-level optimization typically focuses on:

Rank mapping and process placement

On a cluster, where your MPI ranks land can significantly affect performance:

At cluster level, considerations include:

These decisions are typically influenced by:

Balancing work across nodes

At cluster scale, load imbalance across nodes can dominate runtime, even if threads are well balanced within each node.

Sources of cluster-level imbalance:

Cluster-focused strategies:

Hybrid-specific consideration:

Scaling behavior across multiple nodes

When you increase the number of nodes, performance is affected in ways specific to cluster-level parallelism:

Cluster-level strategies to improve scaling:

Hybrid-aware use of cluster interconnects

Cluster interconnect properties (latency, bandwidth, topology, offload capabilities) strongly affect how you design cluster-level parallelism:

On some systems, libraries and MPI implementations can exploit:

Cluster-level parallelism should be organized so that:

Practical cluster-level job configuration

When launching hybrid jobs across a cluster, typical user decisions include:

Typical configuration reasoning:

Cluster-level parallelism is measured and tuned by:

Summary of cluster-level focus in hybrid codes

For hybrid parallel programs, cluster-level parallelism:

Views: 41

Comments

Please login to add a comment.

Don't have an account? Register now!