9.3 Node-level parallelism

Table of Contents

Understanding Node-Level Parallelism in Hybrid HPC Codes

Node-level parallelism refers to how you exploit all the resources within a single compute node (CPU sockets, cores, hardware threads, and sometimes accelerators) before or alongside scaling out across multiple nodes with MPI. In a hybrid MPI+OpenMP (or MPI+threads) program, this is where OpenMP (or another threading model) typically operates.

This chapter focuses on:

The hardware resources inside a node that you can parallelize over
How MPI ranks and threads share the node
Typical strategies for organizing work inside a node
Practical considerations: core pinning, memory locality, oversubscription

Higher-level “why hybrid?” and “how MPI+OpenMP work together” are covered in other chapters; here the emphasis is how to effectively use a single node.

Hardware Resources Inside a Node

A typical HPC node contains:

One or more sockets (packages), each with a CPU
Each CPU containing cores
Each core supporting one or more hardware threads (e.g. Intel Hyper‑Threading)
A hierarchy of caches and memory controllers
Often NUMA (Non‑Uniform Memory Access) regions, where each socket has local memory

From a node-level parallelism viewpoint, the main levers you control are:

How many MPI processes you launch per node
How many threads each process uses
How those processes and threads are bound to cores and NUMA regions

Forms of Node-Level Parallelism

Within a node you can combine several forms of parallelism:

Multi-process (MPI ranks on the same node)
Multiple ranks on one node, communicating via shared memory or MPI’s shared-memory transport.
Multi-threading (e.g. OpenMP, pthreads, TBB)
Multiple threads within a single process sharing memory and data structures.
SIMD / vectorization
Compilers use vector instructions within each core. This is orthogonal to threads and MPI and is usually considered “in-core” parallelism.
Accelerators (GPUs, etc.)
When present, work is offloaded to accelerators on the node. How you share GPUs among ranks/threads is part of node-level resource planning, but detailed accelerator usage is covered elsewhere.

Effective node-level design generally aims to:

Use all physical cores efficiently
Preserve cache and NUMA locality
Avoid oversubscription and resource contention

MPI Ranks vs Threads on a Node

In a hybrid program you must decide:

How many MPI ranks per node (R)
How many threads per rank (T)

subject to:

$$
R \times T \approx \text{number of usable hardware threads per node}
$$

Common patterns:

One rank per core (no threading): R = #cores, T = 1
Pure MPI; simple but can stress the node’s memory system and MPI stack.
One rank per socket, threaded across the socket:
R = #sockets, T = #cores per socket
Good for NUMA locality and for reducing MPI communication endpoints.
One rank per NUMA domain:
On systems where a socket has multiple NUMA domains; each rank handles its local domain with threads.
Few ranks per node, many threads each:
Useful for codes with strong shared-memory components and large working sets.

The “best” configuration is highly application- and architecture-dependent and often requires benchmarking.

Work Decomposition Inside a Node

Once the MPI layer distributes work across nodes, node-level parallelism decides how work inside each rank is divided among threads.

Typical patterns:

Loop parallelism
Use OpenMP to parallelize computational loops:

  #pragma omp parallel for
  for (int i = 0; i < N; ++i) {
      // work on local portion of data
  }

Work is evenly or dynamically split among threads on that node.

Task-based parallelism
Use tasks for irregular or hierarchical work:

  #pragma omp parallel
  #pragma omp single
  {
      for (int t = 0; t < NTASKS; ++t) {
          #pragma omp task
          do_task(t);
      }
  }

Threads pull tasks from a shared queue, balancing load within the node.

Domain tiling / blocking
Within each MPI subdomain, subdivide work into tiles or blocks given to threads. This improves cache reuse while exploiting all cores.
Pipeline within a node
Different threads handle different stages of a pipeline (I/O, compute, post-processing) but stay within one process on the node.

Design questions at node level:

Do threads share a large array, or does each thread operate on its own subrange?
Is the computation regular (good for simple parallel for) or irregular (better for tasks)?
How does the decomposition interact with caches (tile size, data layout)?

Core Binding and Affinity

A central issue in node-level parallelism is where your threads and ranks run physically.

Reasons to control affinity:

Reduced context switching and cache thrashing
Better NUMA locality (data closer to the thread that uses it)
More reproducible performance

Common tools and concepts:

Scheduler options:
In SLURM, for example:

--ntasks-per-node (MPI ranks per node)
--cpus-per-task (threads per MPI rank)
--ntasks-per-socket or --distribution=block,cyclic
--hint=nomultithread to avoid using SMT/Hyper‑Threads

MPI process binding:
MPI implementations often have options like:

--bind-to core, --bind-to socket, --map-by socket
or environment variables controlling rank placement

OpenMP thread binding:
Environment variables such as:

OMP_PROC_BIND=true (or close, spread)
OMP_PLACES=cores (or threads, sockets)

Example: one MPI rank per socket, 8 cores per socket:

srun --ntasks-per-node=2 --cpus-per-task=8 \
     --hint=nomultithread \
     ./my_hybrid_app

and inside the application:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores

This tries to keep each rank’s threads tightly packed on its local cores.

NUMA-Aware Node-Level Parallelism

On NUMA nodes, memory access time depends on which socket’s memory is accessed by which core. Poor node-level design can double effective memory latency.

Key ideas:

First-touch allocation
Many OSes place physical memory pages on the NUMA node of the core that first writes to them. Therefore, initialization loops should be parallelized in the same way as the later computation:

  #pragma omp parallel for
  for (long i = 0; i < N; ++i) {
      a[i] = 0.0;  // first-touch initialization
  }

so each thread “owns” and touches the data it will later use.

Rank-to-NUMA mapping
Place one MPI rank per NUMA domain, with its threads bound inside that domain. Each rank mostly accesses its own local memory.
Avoid frequent cross-socket accesses
Shared global data structures that are heavily updated from all sockets can cause remote accesses and contention. Prefer per-socket (or per-thread) data and reduce/merge infrequently.

Oversubscription and Hardware Threads

Node-level parallelism should match hardware capabilities:

Physical cores vs hardware threads
If a node has 32 physical cores with 2 hardware threads each, you might see 64 “logical CPUs.” Many workloads perform best using only the 32 physical cores.
Oversubscription
Running more runnable threads or ranks than hardware threads, e.g. R * T > #logical CPUs:

Usually degrades performance due to context switching and cache thrashing.
Occasionally used if some threads block on I/O, but rarely beneficial in pure compute HPC codes.

Recommended practice for compute-intensive hybrid codes:

Ensure:
$$ R \times T \le \text{number of logical CPUs} $$
Often:
$$ R \times T = \text{number of physical cores} $$
gives the best performance.

Balancing Work Within a Node

Even if load is balanced across MPI ranks, it can be imbalanced across threads within a node.

Node-level strategies:

OpenMP scheduling

schedule(static) for regular workloads
schedule(dynamic, chunk) or guided for irregular workloads

Per-thread work queues or task pools
For more complex algorithms, using tasks or manual work stealing within the node.
Granularity tuning
Too fine-grained parallelism inside the node can incur synchronization overhead; too coarse can lead to idle threads.

Example:

#pragma omp parallel for schedule(dynamic, 4)
for (int i = 0; i < Nblocks; ++i) {
    process_block(i);
}

This allows faster threads to pick up extra blocks, balancing workload within the node.

Node-Level Parallel I/O Considerations

When multiple ranks and threads on a node perform I/O:

Avoid all threads writing small files; this can overwhelm the filesystem.
Common hybrid approaches:

A single I/O thread per rank handling writes for that rank’s threads.
A single I/O rank per node collecting data from intra-node peers and performing larger, batched writes.

Use buffering and aggregation at node level to reduce pressure on the parallel filesystem.

Typical Node-Level Configurations in Practice

Some representative setups:

Memory-bandwidth bound stencil code on a dual-socket node with 2×16 cores:

2 ranks per node (1 per socket)
16 OpenMP threads per rank
Bind ranks and threads to their socket
NUMA-aware data layout and first-touch

Communication-heavy MPI application with moderate compute per rank:

Many smaller MPI ranks per node (e.g. 8 or 16), each with 2–4 threads
Keep per-rank memory footprint small
Use node-level threading mainly for inner loops, not for all work

Irregular workload with tasks:

Few MPI ranks per node (e.g. 1–2)
Many OpenMP threads using task-based parallelism
Node-level dynamic load balancing handled by the runtime

Practical Tips for Node-Level Tuning

Start with a simple mapping (e.g. 1 rank per socket, threads per core), then measure.
Use profiling tools that show per-core and per-socket utilization to check:

Are some cores idle?
Is one socket saturated while the other is underused?

Check memory bandwidth and NUMA locality:

Tools (e.g. numactl, platform-specific profilers) can show remote vs local accesses.

Experiment with:

Different R × T combinations
OpenMP scheduling policies
Core and NUMA binding strategies

Node-level parallelism is where much of the performance tuning of hybrid codes happens: the hardware is fixed, but how you map ranks, threads, and data to that hardware can change performance by large factors.

Comments

Please login to add a comment.

Don't have an account? Register now!