Kahibaro
Discord Login Register

Node-level parallelism

Understanding Node-Level Parallelism in Hybrid HPC Codes

Node-level parallelism refers to how you exploit all the resources within a single compute node (CPU sockets, cores, hardware threads, and sometimes accelerators) before or alongside scaling out across multiple nodes with MPI. In a hybrid MPI+OpenMP (or MPI+threads) program, this is where OpenMP (or another threading model) typically operates.

This chapter focuses on:

Higher-level “why hybrid?” and “how MPI+OpenMP work together” are covered in other chapters; here the emphasis is how to effectively use a single node.

Hardware Resources Inside a Node

A typical HPC node contains:

From a node-level parallelism viewpoint, the main levers you control are:

Forms of Node-Level Parallelism

Within a node you can combine several forms of parallelism:

Effective node-level design generally aims to:

MPI Ranks vs Threads on a Node

In a hybrid program you must decide:

subject to:

$$
R \times T \approx \text{number of usable hardware threads per node}
$$

Common patterns:

The “best” configuration is highly application- and architecture-dependent and often requires benchmarking.

Work Decomposition Inside a Node

Once the MPI layer distributes work across nodes, node-level parallelism decides how work inside each rank is divided among threads.

Typical patterns:

  #pragma omp parallel for
  for (int i = 0; i < N; ++i) {
      // work on local portion of data
  }

Work is evenly or dynamically split among threads on that node.

  #pragma omp parallel
  #pragma omp single
  {
      for (int t = 0; t < NTASKS; ++t) {
          #pragma omp task
          do_task(t);
      }
  }

Threads pull tasks from a shared queue, balancing load within the node.

Design questions at node level:

Core Binding and Affinity

A central issue in node-level parallelism is where your threads and ranks run physically.

Reasons to control affinity:

Common tools and concepts:

Example: one MPI rank per socket, 8 cores per socket:

srun --ntasks-per-node=2 --cpus-per-task=8 \
     --hint=nomultithread \
     ./my_hybrid_app

and inside the application:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores

This tries to keep each rank’s threads tightly packed on its local cores.

NUMA-Aware Node-Level Parallelism

On NUMA nodes, memory access time depends on which socket’s memory is accessed by which core. Poor node-level design can double effective memory latency.

Key ideas:

  #pragma omp parallel for
  for (long i = 0; i < N; ++i) {
      a[i] = 0.0;  // first-touch initialization
  }

so each thread “owns” and touches the data it will later use.

Oversubscription and Hardware Threads

Node-level parallelism should match hardware capabilities:

Recommended practice for compute-intensive hybrid codes:

Balancing Work Within a Node

Even if load is balanced across MPI ranks, it can be imbalanced across threads within a node.

Node-level strategies:

Example:

#pragma omp parallel for schedule(dynamic, 4)
for (int i = 0; i < Nblocks; ++i) {
    process_block(i);
}

This allows faster threads to pick up extra blocks, balancing workload within the node.

Node-Level Parallel I/O Considerations

When multiple ranks and threads on a node perform I/O:

Typical Node-Level Configurations in Practice

Some representative setups:

  1. Memory-bandwidth bound stencil code on a dual-socket node with 2×16 cores:
    • 2 ranks per node (1 per socket)
    • 16 OpenMP threads per rank
    • Bind ranks and threads to their socket
    • NUMA-aware data layout and first-touch
  2. Communication-heavy MPI application with moderate compute per rank:
    • Many smaller MPI ranks per node (e.g. 8 or 16), each with 2–4 threads
    • Keep per-rank memory footprint small
    • Use node-level threading mainly for inner loops, not for all work
  3. Irregular workload with tasks:
    • Few MPI ranks per node (e.g. 1–2)
    • Many OpenMP threads using task-based parallelism
    • Node-level dynamic load balancing handled by the runtime

Practical Tips for Node-Level Tuning

Node-level parallelism is where much of the performance tuning of hybrid codes happens: the hardware is fixed, but how you map ranks, threads, and data to that hardware can change performance by large factors.

Views: 12

Comments

Please login to add a comment.

Don't have an account? Register now!