Table of Contents
Understanding Node-Level Parallelism in Hybrid HPC Codes
Node-level parallelism refers to how you exploit all the resources within a single compute node (CPU sockets, cores, hardware threads, and sometimes accelerators) before or alongside scaling out across multiple nodes with MPI. In a hybrid MPI+OpenMP (or MPI+threads) program, this is where OpenMP (or another threading model) typically operates.
This chapter focuses on:
- The hardware resources inside a node that you can parallelize over
- How MPI ranks and threads share the node
- Typical strategies for organizing work inside a node
- Practical considerations: core pinning, memory locality, oversubscription
Higher-level “why hybrid?” and “how MPI+OpenMP work together” are covered in other chapters; here the emphasis is how to effectively use a single node.
Hardware Resources Inside a Node
A typical HPC node contains:
- One or more sockets (packages), each with a CPU
- Each CPU containing cores
- Each core supporting one or more hardware threads (e.g. Intel Hyper‑Threading)
- A hierarchy of caches and memory controllers
- Often NUMA (Non‑Uniform Memory Access) regions, where each socket has local memory
From a node-level parallelism viewpoint, the main levers you control are:
- How many MPI processes you launch per node
- How many threads each process uses
- How those processes and threads are bound to cores and NUMA regions
Forms of Node-Level Parallelism
Within a node you can combine several forms of parallelism:
- Multi-process (MPI ranks on the same node)
Multiple ranks on one node, communicating via shared memory or MPI’s shared-memory transport. - Multi-threading (e.g. OpenMP, pthreads, TBB)
Multiple threads within a single process sharing memory and data structures. - SIMD / vectorization
Compilers use vector instructions within each core. This is orthogonal to threads and MPI and is usually considered “in-core” parallelism. - Accelerators (GPUs, etc.)
When present, work is offloaded to accelerators on the node. How you share GPUs among ranks/threads is part of node-level resource planning, but detailed accelerator usage is covered elsewhere.
Effective node-level design generally aims to:
- Use all physical cores efficiently
- Preserve cache and NUMA locality
- Avoid oversubscription and resource contention
MPI Ranks vs Threads on a Node
In a hybrid program you must decide:
- How many MPI ranks per node (
R) - How many threads per rank (
T)
subject to:
$$
R \times T \approx \text{number of usable hardware threads per node}
$$
Common patterns:
- One rank per core (no threading):
R = #cores,T = 1
Pure MPI; simple but can stress the node’s memory system and MPI stack. - One rank per socket, threaded across the socket:
R = #sockets,T = #cores per socket
Good for NUMA locality and for reducing MPI communication endpoints. - One rank per NUMA domain:
On systems where a socket has multiple NUMA domains; each rank handles its local domain with threads. - Few ranks per node, many threads each:
Useful for codes with strong shared-memory components and large working sets.
The “best” configuration is highly application- and architecture-dependent and often requires benchmarking.
Work Decomposition Inside a Node
Once the MPI layer distributes work across nodes, node-level parallelism decides how work inside each rank is divided among threads.
Typical patterns:
- Loop parallelism
Use OpenMP to parallelize computational loops:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
// work on local portion of data
}Work is evenly or dynamically split among threads on that node.
- Task-based parallelism
Use tasks for irregular or hierarchical work:
#pragma omp parallel
#pragma omp single
{
for (int t = 0; t < NTASKS; ++t) {
#pragma omp task
do_task(t);
}
}Threads pull tasks from a shared queue, balancing load within the node.
- Domain tiling / blocking
Within each MPI subdomain, subdivide work into tiles or blocks given to threads. This improves cache reuse while exploiting all cores. - Pipeline within a node
Different threads handle different stages of a pipeline (I/O, compute, post-processing) but stay within one process on the node.
Design questions at node level:
- Do threads share a large array, or does each thread operate on its own subrange?
- Is the computation regular (good for simple
parallel for) or irregular (better for tasks)? - How does the decomposition interact with caches (tile size, data layout)?
Core Binding and Affinity
A central issue in node-level parallelism is where your threads and ranks run physically.
Reasons to control affinity:
- Reduced context switching and cache thrashing
- Better NUMA locality (data closer to the thread that uses it)
- More reproducible performance
Common tools and concepts:
- Scheduler options:
In SLURM, for example: --ntasks-per-node(MPI ranks per node)--cpus-per-task(threads per MPI rank)--ntasks-per-socketor--distribution=block,cyclic--hint=nomultithreadto avoid using SMT/Hyper‑Threads- MPI process binding:
MPI implementations often have options like: --bind-to core,--bind-to socket,--map-by socket- or environment variables controlling rank placement
- OpenMP thread binding:
Environment variables such as: OMP_PROC_BIND=true(orclose,spread)OMP_PLACES=cores(orthreads,sockets)
Example: one MPI rank per socket, 8 cores per socket:
srun --ntasks-per-node=2 --cpus-per-task=8 \
--hint=nomultithread \
./my_hybrid_appand inside the application:
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=coresThis tries to keep each rank’s threads tightly packed on its local cores.
NUMA-Aware Node-Level Parallelism
On NUMA nodes, memory access time depends on which socket’s memory is accessed by which core. Poor node-level design can double effective memory latency.
Key ideas:
- First-touch allocation
Many OSes place physical memory pages on the NUMA node of the core that first writes to them. Therefore, initialization loops should be parallelized in the same way as the later computation:
#pragma omp parallel for
for (long i = 0; i < N; ++i) {
a[i] = 0.0; // first-touch initialization
}so each thread “owns” and touches the data it will later use.
- Rank-to-NUMA mapping
Place one MPI rank per NUMA domain, with its threads bound inside that domain. Each rank mostly accesses its own local memory. - Avoid frequent cross-socket accesses
Shared global data structures that are heavily updated from all sockets can cause remote accesses and contention. Prefer per-socket (or per-thread) data and reduce/merge infrequently.
Oversubscription and Hardware Threads
Node-level parallelism should match hardware capabilities:
- Physical cores vs hardware threads
If a node has 32 physical cores with 2 hardware threads each, you might see 64 “logical CPUs.” Many workloads perform best using only the 32 physical cores. - Oversubscription
Running more runnable threads or ranks than hardware threads, e.g.R * T > #logical CPUs: - Usually degrades performance due to context switching and cache thrashing.
- Occasionally used if some threads block on I/O, but rarely beneficial in pure compute HPC codes.
Recommended practice for compute-intensive hybrid codes:
- Ensure:
$$ R \times T \le \text{number of logical CPUs} $$ - Often:
$$ R \times T = \text{number of physical cores} $$
gives the best performance.
Balancing Work Within a Node
Even if load is balanced across MPI ranks, it can be imbalanced across threads within a node.
Node-level strategies:
- OpenMP scheduling
schedule(static)for regular workloadsschedule(dynamic, chunk)orguidedfor irregular workloads- Per-thread work queues or task pools
For more complex algorithms, using tasks or manual work stealing within the node. - Granularity tuning
Too fine-grained parallelism inside the node can incur synchronization overhead; too coarse can lead to idle threads.
Example:
#pragma omp parallel for schedule(dynamic, 4)
for (int i = 0; i < Nblocks; ++i) {
process_block(i);
}This allows faster threads to pick up extra blocks, balancing workload within the node.
Node-Level Parallel I/O Considerations
When multiple ranks and threads on a node perform I/O:
- Avoid all threads writing small files; this can overwhelm the filesystem.
- Common hybrid approaches:
- A single I/O thread per rank handling writes for that rank’s threads.
- A single I/O rank per node collecting data from intra-node peers and performing larger, batched writes.
- Use buffering and aggregation at node level to reduce pressure on the parallel filesystem.
Typical Node-Level Configurations in Practice
Some representative setups:
- Memory-bandwidth bound stencil code on a dual-socket node with 2×16 cores:
- 2 ranks per node (1 per socket)
- 16 OpenMP threads per rank
- Bind ranks and threads to their socket
- NUMA-aware data layout and first-touch
- Communication-heavy MPI application with moderate compute per rank:
- Many smaller MPI ranks per node (e.g. 8 or 16), each with 2–4 threads
- Keep per-rank memory footprint small
- Use node-level threading mainly for inner loops, not for all work
- Irregular workload with tasks:
- Few MPI ranks per node (e.g. 1–2)
- Many OpenMP threads using task-based parallelism
- Node-level dynamic load balancing handled by the runtime
Practical Tips for Node-Level Tuning
- Start with a simple mapping (e.g. 1 rank per socket, threads per core), then measure.
- Use profiling tools that show per-core and per-socket utilization to check:
- Are some cores idle?
- Is one socket saturated while the other is underused?
- Check memory bandwidth and NUMA locality:
- Tools (e.g.
numactl, platform-specific profilers) can show remote vs local accesses. - Experiment with:
- Different
R × Tcombinations - OpenMP scheduling policies
- Core and NUMA binding strategies
Node-level parallelism is where much of the performance tuning of hybrid codes happens: the hardware is fixed, but how you map ranks, threads, and data to that hardware can change performance by large factors.