9.3 Node-level parallelism

Table of Contents

Scope of Node-level Parallelism

Node-level parallelism refers to how you use all the computational resources inside a single cluster node. A typical modern node contains several CPU cores, often multiple sockets, large main memory, and sometimes one or more accelerators such as GPUs. In a hybrid HPC program, node-level parallelism usually means how you combine shared-memory techniques, especially OpenMP threads, with the MPI processes that are used at cluster level.

At this level you care about threads, cores, caches, NUMA domains, memory bandwidth, and the mapping of work to these resources. The aim is to keep every core on the node busy with useful work, while avoiding contention and unnecessary data movement.

Node-level parallelism connects what you learned about shared-memory programming to the broader hybrid model, where multiple nodes cooperate through MPI and each node exploits its own internal parallelism using threads.

Hardware Structure Inside a Node

A node is a small parallel computer. It typically has one or more CPU sockets. Each socket has several cores that share some levels of cache and access to memory. Sometimes there is also a hierarchy of caches, for example L1 and L2 private to cores and a shared L3.

Many nodes use a NUMA layout. NUMA stands for Non Uniform Memory Access. In a NUMA node, each socket has local memory that it can access quickly, and it can also access memory attached to other sockets more slowly. From the programmer’s perspective, all this memory forms one address space for threads on the node, but access times are not uniform.

Within a node, you may also have GPUs or other accelerators that interact with the CPU via PCIe or a similar interconnect. GPU use is covered elsewhere. In the context of node-level parallelism, GPUs matter because they change how many CPU cores you want to devote to CPU work versus feeding the accelerators and performing I/O and communication.

The internal structure of a node strongly influences how you should place MPI processes and OpenMP threads, how you allocate memory, and which parallelism model you use inside the node.

Roles of MPI Processes and Threads on a Node

In hybrid programming, node-level parallelism is often realized as several MPI processes per node, each of which spawns multiple OpenMP threads. This creates a two-level hierarchy. MPI ranks handle interprocess communication across nodes and sometimes across NUMA domains on a node. OpenMP threads then parallelize loops and regions inside each process.

There are different layouts for processes and threads on a node. One common pattern is one MPI process per NUMA domain and many threads per process. Another is one MPI process per core and no threads. The pure MPI case is still a form of node-level parallelism but it does not use shared-memory threading.

Although MPI and OpenMP both allow parallel work within the node, they differ in memory model and overheads. MPI processes have separate memory spaces, which can improve locality but makes data exchange explicit and sometimes more expensive. OpenMP threads share memory, which makes data sharing easy but requires careful synchronization.

Node-level design must decide how many MPI ranks to run on the node, how many threads per rank, and where to place each rank and its threads relative to cores and NUMA domains.

Mapping Parallelism to Cores and Sockets

Node-level performance depends strongly on how you map software entities to hardware cores. The two central decisions are process placement and thread placement. This is often called affinity or pinning.

At the process level, you control which cores and sockets each MPI rank can use. A rank that is responsible for data stored in a particular NUMA domain ideally runs on a core in that domain. This reduces remote memory accesses.

At the thread level, you choose whether and how to pin threads to cores. OpenMP implementations usually support environment variables that define how threads are mapped. Without a careful mapping, threads may migrate between cores, which can cause poor cache reuse and extra overhead.

Node-level mapping should respect cache sharing patterns. For example, if two threads frequently access shared data, placing them on sibling cores that share a cache may improve performance. On the other hand, if they independently work on data sets that almost fill a cache, placing them on cores that compete for the same cache can be harmful.

The mapping also interacts with Simultaneous Multithreading, also called SMT or hyperthreading. If the node has SMT enabled, each physical core presents multiple logical cores. You must decide whether to use all logical cores, which can help throughput but may reduce performance for memory bound codes, or to use only one thread per physical core.

NUMA-aware Parallelism and Memory Placement

NUMA behavior is one of the most important node-level effects. Access to local memory is faster than access to remote memory. On many systems the latency and bandwidth penalty for remote access is significant. If threads constantly read and write data in a remote NUMA domain, the node-level parallel efficiency will drop.

NUMA-aware programming attempts to align threads, processes, and data allocations. A common rule is called first touch. Operating systems often allocate physical memory to a page the first time a thread writes to it. If you initialize a large array in parallel, with each thread touching the part of the array it will later use, the operating system can place those pages in the NUMA domain local to the thread.

To apply first touch effectively you must ensure that the initialization phase runs with the same mapping of threads to cores that will be used later. A serial initialization from a single thread typically places all data in the memory local to that thread’s core or socket. Other threads then access this data remotely.

In hybrid programs, you can combine MPI and OpenMP to express NUMA-aware ownership of data. For example, each MPI rank can own a subdomain of the problem that is stored in memory local to its socket, and within each rank OpenMP threads work on chunks of that subdomain. This reduces cross-socket memory traffic.

Some systems provide explicit NUMA control through libraries and tools that let you bind memory allocations to a particular NUMA node. These are advanced options that build on the same fundamental idea: keep data and the threads that use it together on the same part of the node.

Balancing Work and Resources Inside the Node

Node-level parallelism must balance the work across all cores on the node. This is related to general load balancing but with some specifics. The total amount of work per node is fixed by the partitioning across MPI ranks, and within a node you must distribute that work fairly among threads.

OpenMP and similar models provide static and dynamic scheduling policies for loop iterations. Static scheduling assigns fixed chunks of iterations to each thread. This has low overhead and often works well when each iteration has similar cost. Dynamic scheduling distributes iterations to threads at runtime and can handle irregular workloads. The cost is higher runtime overhead and potentially worse data locality.

Within a node, you can often combine coarse grain partitioning among MPI ranks with fine grain scheduling among threads. For example, each rank receives a block of consecutive rows of a grid, and OpenMP uses static scheduling across threads to divide rows within that block. In more irregular applications, dynamic or guided scheduling can smooth out imbalances while still exploiting shared memory.

Balancing also includes resource competition. Threads share the memory bus and caches. If you create more threads than physical cores or if each thread uses very large working sets, they will compete heavily for memory bandwidth and cache space. In that case, fewer threads per node can be faster than trying to use every hardware context.

Node-level tuning often involves experiments that vary the number of threads per process and processes per node, combined with different scheduling policies, to find the configuration that best matches the workload and the characteristics of the node.

Interaction of Node-level Parallelism with Memory Bandwidth

On many nodes, compute capability per core has grown faster than memory bandwidth. As a result, many HPC codes are memory bound at node level. This means that adding more threads does not always increase throughput proportionally, because all threads are waiting for data.

Within a node, each socket has a limited number of memory channels. When enough cores issue loads and stores, these channels saturate. At that point additional cores add very little performance and can increase contention. Node-level parallelism must consider this saturation point.

Thread parallelism on memory bound loops can still be useful up to the point where the bandwidth is saturated. A rough way to explore this is to run the application with increasing thread counts while keeping the number of MPI processes per node fixed. When performance stops improving, or even worsens, you may have reached the bandwidth limit.

Certain optimizations at node level can reduce memory pressure. These include blocking or tiling loops to increase cache reuse, reordering data structures to improve spatial locality, and using vectorization. These topics are developed elsewhere. The important point for node-level parallelism is that thread-level scaling is limited by available data movement capacity as much as by compute capacity.

For hybrid codes, you may sometimes improve node-level performance by increasing the number of MPI processes per node while reducing threads per process. Individual MPI processes then work on smaller data sets that may fit better in cache, at the cost of more intra-node MPI communication. Modern MPI libraries can use shared memory optimizations for processes on the same node to reduce that cost.

Typical Hybrid Configurations on a Node

There are several common ways to exploit node-level parallelism in a hybrid MPI plus OpenMP code. Each configuration represents a different tradeoff between simplicity, NUMA friendliness, and communication overhead.

One extreme is pure MPI. You run one MPI process per core and no threads. All parallelism, including node-level parallelism, is expressed through processes. This configuration keeps memory spaces separate and can be simpler to reason about, but increases the number of MPI ranks and messages.

Another configuration is one MPI process per node and many threads inside this process. This uses OpenMP for almost all node-level and also some inter-socket parallel work. Data is shared in one address space. However, managing NUMA performance can become more complex, because threads on different sockets may access the same data.

A compromise that often works well is one MPI process per socket and several threads per process. Here each process owns data allocated in its socket’s local memory, and threads in that process work inside that NUMA domain. Communication across sockets and across nodes is handled by MPI. This configuration often balances memory locality and message count.

For codes that are strongly bandwidth bound, it is also common to undersubscribe the node. This means you deliberately use fewer active threads or processes than available cores. For example, if the node has 64 cores, you might run 32 threads while still using all memory channels. The unused cores can reduce competition for shared resources.

Controlling Threading Inside MPI Ranks

To exploit node-level parallelism cleanly, you need to control threading behavior of each MPI rank. This is especially important on shared clusters where environment settings such as OMP_NUM_THREADS and affinity variables may be preconfigured.

Each rank typically reads environment variables at startup to determine the number of threads it will use and how they are bound to cores. When you launch a hybrid job, you arrange the job scheduler options and these variables so that the total number of threads across all ranks does not exceed the number of available cores on the node unless you explicitly want oversubscription.

You must also be careful that MPI itself does not spawn extra threads unexpectedly. Some MPI implementations can internally use threads for communication progress or for I/O. If you request higher levels of MPI thread support, these internal threads can appear. For pure node-level parallelism it is often sufficient to use an MPI thread level that allows the combination you need without introducing unnecessary concurrency.

Within the application, you place parallel regions where they give the best payoff at node level. It is often better to have a smaller number of large parallel regions instead of many short regions, so that threads remain active and do not pay repeated startup and shutdown costs.

Measuring and Tuning Node-level Performance

Efficient node-level parallelism is not guaranteed by theory alone. You need measurements to see how well your program uses the cores and memory within the node. Basic runtime measurements across runs with different thread and process counts can reveal scaling behavior. More detailed tools can show whether you are limited by CPU, memory bandwidth, or synchronization.

At node level, key questions include how much time is spent in serial regions compared to parallel regions, how evenly the work is divided among threads, and whether any cores are idle while others are busy. Profilers and hardware performance counters can give you information about cache misses, memory bandwidth, and instruction mix.

An iterative tuning process often works best. You start from a simple mapping such as one rank per socket and a fixed thread count. You then vary the number of threads per rank, adjust scheduling policies for loops, and experiment with different affinity and NUMA settings. Each change is tested to see whether node-level performance improves.

Node-level parallel efficiency is strongly influenced by NUMA locality, memory bandwidth limits, and correct mapping of MPI ranks and threads to cores. Always measure performance while varying the number of threads per process and processes per node to identify a configuration that uses the node’s resources effectively.

The choices you make at node level will interact with cluster-level design, but understanding and tuning this level separately is a crucial step in building efficient hybrid HPC applications.

Comments

Please login to add a comment.

Don't have an account? Register now!