Hybrid Parallel Programming

Table of Contents

Motivation and Context

Hybrid parallel programming refers to the deliberate combination of different parallel programming models in a single application. In the context of this course, it primarily means using MPI for distributed memory parallelism across nodes, and OpenMP or similar thread models for shared memory parallelism within a node.

Modern HPC systems are built from nodes that each have many cores and often multiple sockets, and clusters can combine thousands of such nodes. A single node is typically a shared memory machine, where threads can access the same address space, while the cluster as a whole behaves as a distributed memory system, where data must be communicated explicitly between nodes. Hybrid programming arises naturally from this hardware structure.

The essential idea is to use one level of parallelism to exploit hardware parallelism between nodes, and another level to exploit hardware parallelism within nodes. For many applications, this leads to better scalability, improved resource usage, and sometimes a simpler mapping from the algorithm to the machine.

Two-Level Parallelism: Nodes and Cores

A hybrid program usually runs one or more MPI processes per node, and within each MPI process multiple threads operate in shared memory. On a cluster with $N$ nodes and $C$ cores per node, you might choose to run $P$ MPI processes per node and $T$ threads per process, where $P \times T \le C$.

The simple relation
$$
\text{total hardware threads} = N \times C
$$
is split logically into
$$
\text{total MPI processes} = N \times P
$$
and
$$
\text{threads per process} = T
$$
so that the degree of parallelism in the application is
$$
\text{total concurrent execution units} = (N \times P) \times T.
$$

This separation lets you assign responsibilities. MPI processes typically handle the distributed aspects of the computation, such as managing separate portions of the global problem and performing communications between nodes. Threads typically refine the work within each MPI process, often by parallelizing loops, local computations, and node-local data structures.

A central design choice in a hybrid program is how to map the mathematical or algorithmic decomposition of the problem onto this two-level parallel structure. For example, in a domain decomposition method, you might assign large spatial subdomains to MPI processes and then divide each subdomain further among threads.

Advantages of Hybrid Programming

Hybrid programming is not automatically superior to pure MPI or pure shared memory programming. It offers specific advantages that become important on large systems or complex algorithms.

One advantage is memory efficiency. Pure MPI codes that create one process per core may duplicate metadata or small data structures in every process. Hybrid codes can store such data once per MPI process and let multiple threads share it, which can reduce the total memory footprint per node. This can matter for applications that are close to the memory capacity limit.

Another advantage is communication structure. With fewer MPI processes per node, there are fewer communicating entities. This can reduce MPI overheads and lead to communications that are more coarse grained. For example, instead of each of 64 cores sending its own small message, a single MPI process on the node can aggregate data from its threads and send a single larger message to other nodes.

Hybrid programming can also better reflect hardware hierarchies. A node may have several NUMA domains or sockets, and each CPU core may support vector operations. By using processes to match NUMA domains and threads to exploit core parallelism, an application can align with the memory access characteristics of the machine in a more natural way.

Lastly, a hybrid design can offer more flexibility in scheduling. Threads can work on pools of local tasks, which can help with fine grained load balancing within a node, while MPI handles the coarse assignment of work between nodes.

Conceptual Design of Hybrid Codes

A hybrid program is easier to design if you first separate concerns conceptually. At the highest level, you decide which parts of your computation are handled by MPI and which parts are handled by threads.

A common approach is to perform the large scale partitioning of the problem with MPI. Different MPI ranks are responsible for distinct parts of the global domain or different subsets of tasks that require little sharing of data. Within each MPI process, the local work is further parallelized with threads, for instance by using OpenMP to parallelize loops over elements, grid points, particles, or matrix rows.

Another design pattern uses MPI to handle communication heavy or latency sensitive parts, while threads perform compute heavy work that fits well within cache and shared memory. For example, in a linear algebra code, MPI may orchestrate the distribution of blocks of a matrix between nodes, while threads perform the local matrix multiplications or factorization steps.

When designing such an application, it is useful to draw a hierarchy:

Global problem decomposition across MPI ranks.

Local data layout and loop structures within each rank.

Thread level work sharing and synchronization inside those loops.

Data structures are often organized so that each MPI rank owns a contiguous portion of global arrays. Threads then work on subsets of this local data without requiring inter rank communication. This hierarchy reduces the need for global synchronization and keeps most synchronization local to threads on the same node.

Interaction between MPI and OpenMP

When MPI and OpenMP are used together, their interactions must be carefully considered. The runtime systems of both libraries share the same process, and their behaviors can influence each other.

One key aspect is thread safety in MPI. MPI libraries can support different levels of threading, which determine what MPI calls are allowed from threads. The standard defines four levels of threading support, usually requested at initialization. Applications that use OpenMP must decide whether they will call MPI only from a single thread or from multiple threads. This choice affects how you structure your parallel regions and how you place MPI calls in the code.

Another important interaction is in the way the runtime uses CPU cores. The MPI library may create internal progress threads to handle communication, and the OpenMP runtime schedules threads onto cores. If both try to use the same cores without coordination, cores may be oversubscribed, and performance can degrade. Hybrid codes therefore need explicit control of thread affinity and process placement through environment variables and job scripts.

In a hybrid program, MPI calls often appear outside the most deeply nested OpenMP parallel regions, so MPI usually sees only one calling thread per process. In more advanced hybrid designs, some or all MPI communication is done from multiple threads, which requires careful use of MPI features that are safe in a multi threaded environment.

Synchronization behavior must also be considered. MPI collective operations imply synchronization across processes. OpenMP barriers synchronize threads. When both are nested, it is possible to create situations where threads inside a process wait at an OpenMP barrier while a single thread is inside an MPI collective, and processes wait for each other. The structure of parallel regions and MPI calls must be designed to avoid unintended interactions of this kind.

Typical Process and Thread Layouts

Hybrid programming introduces new configuration variables on top of the total number of cores or hardware threads. The choice of how many MPI processes per node and how many threads per process has significant performance implications.

One common layout is one MPI process per NUMA domain on each node, with threads spanning the cores inside that domain. For example, on a dual socket node with 32 cores per socket, you may run 2 MPI processes per node and 32 threads per process. This exploits fast memory access within each socket and reduces remote memory accesses across sockets.

Another layout is one MPI process per node, with all cores on the node used as threads. This can work well for codes that are heavily dependent on shared data structures that are naturally accessed by all cores and where the memory system behaves uniformly enough. However, on strongly NUMA nodes, this can introduce nonuniform memory access penalties if threads frequently access data allocated by other threads or bound to other sockets.

You can also choose multiple MPI processes per node, each with a smaller number of threads. For instance, 4 MPI processes per node with 16 threads each on a 64 core node. This can strike a balance between memory locality, communication granularity, and OpenMP scheduling overhead.

The key guideline is that MPI processes should be aligned with hardware boundaries that influence memory locality, such as sockets or NUMA domains, and threads should be aligned with cores within these domains. The optimal combination often depends on both the application characteristics and the specific cluster architecture.

Memory Locality and NUMA Considerations

In hybrid programming, memory locality becomes more prominent, because both MPI and threads can influence where memory is allocated and how it is accessed. On NUMA systems, memory attached to one socket has lower latency when accessed by cores on that socket than when accessed from another socket. Ignoring this can negate many of the potential benefits of hybrid programming.

An approach that often works well is to bind each MPI process to a specific NUMA domain and allocate the main data structures within that process. Threads inside the process then work only on data that is local to that NUMA domain. When threads are first touching memory pages, one can follow a first touch policy, where each thread initializes the portion of the array it will later use. This can ensure that pages are placed in the memory bank close to the thread.

Hybrid codes must also consider false sharing and cache line contention among threads. When combining MPI and OpenMP, the desire to pack data tightly for efficient MPI communications can conflict with the desire to separate data accessed by different threads. Structure and padding of data types often need to be reexamined in a hybrid context.

Memory allocation strategies that work in pure MPI, such as allocating multiple small arrays per process, might need to be modified to reduce overheads and to avoid nonlocal memory accesses when there are multiple threads per process. Some hybrid codes use thread private buffers to avoid thread contention, while still sharing larger global data structures within the process.

Performance Challenges Unique to Hybrid Codes

Hybrid programming introduces performance challenges that do not appear, or are less severe, in single model parallel codes. The most obvious one is the complexity of tuning. There are more parameters to tune, like the number of MPI ranks per node, the number of threads per rank, placement policies, and OpenMP scheduling options.

One frequent problem is oversubscription of cores. If the combination of MPI processes and threads is not matched to the available hardware, multiple threads or processes can run on the same core. This usually leads to poor performance. Careful configuration of your job submission, and explicit settings for thread counts, can prevent this.

Another challenge is achieving good load balance at both levels. Even if the MPI level decomposition is well balanced, imbalances can still occur inside each process if the work sharing between threads is not well aligned with the data distribution. Conversely, excellent thread level balance cannot compensate for an MPI level imbalance where some ranks own substantially more work than others.

There is also the risk of increased latency due to extra synchronization. The use of OpenMP barriers, critical sections, and locks inside a process, combined with MPI synchronizations between processes, can create complex patterns of waiting. Tuning hybrid codes sometimes involves restructuring algorithms to reduce both thread level and process level synchronizations.

Communication and computation overlap can be harder to achieve in hybrid implementations. If MPI calls are serialized in one thread and OpenMP is used for computation, the overlap between communication and computation may be limited. More advanced patterns that use nonblocking MPI or thread based communication require careful design to reap benefits without introducing race conditions.

Finally, debugging performance problems can be more difficult because slowdowns can originate in either MPI, OpenMP, or their interaction. Tools that measure and visualize performance at both levels become important in hybrid development.

When to Use Hybrid Parallel Programming

Hybrid programming is not an obligatory choice for all HPC applications. It is most useful when features of the problem and the machine make it attractive to combine multiple models.

Hybrid approaches are worth considering when a pure MPI implementation runs into scalability limits, such as excessive memory usage per rank or a large number of very small messages between ranks. Using fewer MPI processes and more threads can sometimes reduce these issues.

It is also valuable on systems with many cores per node and significant NUMA effects. Hybrid codes can assign work and data more carefully to nodes and sockets, which can yield better performance than a flat MPI approach that creates one rank per core without regard to memory hierarchy.

Some algorithms naturally have a hierarchical structure. For instance, multigrid methods, hierarchical matrices, and block structured grids all contain nested levels of decomposition. Hybrid programming can mirror this structure, mapping the coarse decomposition to MPI ranks and the finer decomposition to threads.

On the other hand, applications with simple data structures and regular communication patterns may perform very well with pure MPI, especially if they do not need extremely large core counts or do not approach memory capacity on nodes. In such cases, the extra complexity of hybrid programming may not be necessary.

The decision to adopt a hybrid model should be guided by the characteristics of the application, performance measurements, and the target architectures. For a new code intended to run at scale on modern clusters, it often makes sense to consider a hybrid design from the beginning, even if the first implementation uses only a subset of the parallel features.

Conceptual Patterns in Hybrid Codes

Hybrid codes tend to follow several recurring patterns that are useful to recognize conceptually. One pattern is coarse grain MPI with fine grain threading. In this pattern, each MPI process handles a relatively large piece of work, such as a subdomain or a block of a matrix, and threads divide loops inside this work into smaller iterations. This is common in finite difference and finite element solvers.

Another pattern is task based threading within MPI processes. MPI distributes large units of work, and threads within each process pull from a pool of smaller tasks. This can help with irregular workloads where different parts of the domain or different tasks have unpredictable cost. The MPI level ensures coarse load balance, and the thread level scheduling evens out finer imbalances.

A further pattern is communication and computation separation across threads. One or a small subset of threads focuses on handling MPI communications, while other threads continue computation. The goal is to overlap communication with computation. This pattern is more advanced and relies on careful handling of thread safety in MPI.

In all these patterns, there is a conceptual separation between responsibilities assigned to MPI ranks and responsibilities assigned to threads. Being explicit about this separation in the design phase, and documenting it in the code, helps keep hybrid programs understandable and maintainable.

In hybrid parallel programming, always maintain a clear separation of responsibilities between processes and threads, align the process and thread layout with the hardware hierarchy, and avoid oversubscribing cores. Poorly chosen combinations of MPI ranks and threads can lead to worse performance than either pure MPI or pure shared memory approaches.

Motivation for hybrid programming

Combining MPI and OpenMP

Node-level parallelism

Cluster-level parallelism

Common hybrid programming patterns