Table of Contents
Motivation: Why Hybrid Parallel Programming?
Hybrid parallel programming combines multiple parallel models—most commonly MPI for distributed-memory parallelism and OpenMP or GPU programming models for shared-memory or accelerator parallelism—within a single application.
In modern HPC systems, each node typically has:
- Multiple NUMA regions (sockets) and many cores per socket
- Vector units on each core
- Often one or more GPUs or other accelerators
- A high-performance network connecting nodes
Using only a single parallel model (e.g., MPI everywhere) can leave parts of this hardware underutilized or create bottlenecks. Hybrid programming aims to:
- Exploit multiple levels of parallelism
- Across nodes (distributed memory, often via MPI)
- Within nodes (shared memory, often via OpenMP or threads)
- Within cores (vectorization, handled mostly by the compiler and libraries)
- On accelerators (CUDA, OpenACC, etc., when combined)
- Reduce communication overhead
- Fewer MPI processes per node can mean fewer inter-process messages and smaller MPI metadata overhead.
- Intra-node work sharing can happen via threads or GPU kernels instead of explicit message passing.
- Better memory usage
- Threads or GPU kernels on a node can share memory and data structures, potentially reducing duplication compared to a pure-MPI approach.
- Fewer MPI ranks may reduce memory used per rank (buffers, halo regions, ghost cells, etc.).
- Adapt to complex hardware topologies
- Hybrid models can respect sockets/NUMA boundaries, core counts, and GPU placement more naturally than a single flat model.
Hybrid parallel programming is especially common in:
- Large-scale PDE solvers and CFD codes
- Climate and weather models
- Molecular dynamics
- Linear algebra and eigenvalue solvers
- Applications designed to run on leadership-class supercomputers
Common Hybrid Combinations
The most common hybrid combinations in practice include:
- MPI + OpenMP
- MPI between nodes, OpenMP within each node (or within each socket).
- Often considered the “standard” hybrid model on CPU-only clusters.
- MPI + CUDA / OpenACC / HIP
- MPI between nodes and between GPUs, with each MPI process controlling one or more GPUs.
- GPU kernels handle the fine-grained parallelism.
- MPI + OpenMP + GPU
- A three-level hybrid: MPI across nodes, OpenMP across CPU cores, and GPUs for offload from each CPU process or thread.
- Common in codes incrementally ported to GPUs.
- Hybrid with task-based runtimes
- MPI combined with tasking frameworks (e.g., OpenMP tasks, Kokkos, Charm++, HPX, or other runtimes).
- More advanced and less common for beginners, but increasingly relevant.
Design Considerations for Hybrid Models
Designing a hybrid application requires architectural decisions that do not appear in single-model codes. Some key aspects:
Choosing MPI Process and Thread Counts
Typical strategies on a node with $N_{\text{cores}}$:
- One MPI process per node
- Use OpenMP threads across all cores on the node.
- Pros: minimal MPI ranks, potentially simpler MPI patterns.
- Cons: may suffer from NUMA effects and contention if not carefully tuned.
- One MPI process per socket
- Each MPI process spawns OpenMP threads bound to the cores in that socket.
- Often a good compromise between locality and simplicity.
- One MPI process per NUMA region
- More fine-grained than per-socket if the node has multiple NUMA domains per socket.
- Helps preserve memory locality.
- One MPI process per GPU
- Common in GPU-based systems: each rank drives one GPU, optionally with threads per rank.
Selecting the right balance is a performance-tuning decision and can vary per machine and application.
Work Decomposition Across Levels
In a hybrid code, you must decide what kind of workload each level of parallelism handles:
- MPI level (coarse-grained, distributed memory)
- Typically handles domain decomposition over large units (e.g., blocks of a grid, subsets of particles, matrix blocks).
- Communicates halo regions, global reductions, and global data distributions.
- Thread level (shared memory)
- Handles loop parallelization, local data processing, and fine-grained tasks inside each MPI subdomain.
- Helps hide latency of MPI communications via overlap (if designed carefully).
- GPU/accelerator level
- Offloads compute-intensive kernels where high throughput is beneficial.
- Often used inside each MPI rank as a “sub-accelerator” for local work.
Multi-level decomposition should minimize:
- Load imbalance between MPI ranks
- Load imbalance between threads or GPU kernels within a rank
- Redundant computation or communication
Programming Models for Hybrid CPU-Only Codes
Although many variations are possible, the most established hybrid CPU-only pattern is MPI + OpenMP.
Basic MPI + OpenMP Structure
At a high level, an MPI + OpenMP application looks like:
#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Initialize local data based on rank
// ...
#pragma omp parallel
{
int tid = omp_get_thread_num();
// Thread-local work here
}
// MPI communication that may happen between parallel regions
// or even inside them if using MPI with threads
MPI_Finalize();
return 0;
}The hybrid aspects to pay special attention to include:
- MPI initialization for threading
- Use
MPI_Init_threadinstead ofMPI_Initwhen multiple threads might call MPI. - Request an appropriate thread support level (e.g.,
MPI_THREAD_FUNNELEDorMPI_THREAD_MULTIPLE). - Placement and binding
- Use OpenMP environment variables and/or runtime APIs to control how threads are bound to cores.
- Align MPI rank placement with NUMA regions/sockets when launching jobs with the scheduler.
Details of MPI and OpenMP themselves belong to other chapters; here, the focus is how they interact and are combined.
MPI Threading Levels
When combining MPI and threads, you must consider whether threads will call MPI routines:
MPI_THREAD_SINGLE- Only one thread exists in the process (no threading).
MPI_THREAD_FUNNELED- Multiple threads may exist, but only the main thread (typically that which called
MPI_Init_thread) can call MPI. - Common and relatively cheap to support.
MPI_THREAD_SERIALIZED- Multiple threads may call MPI, but not at the same time. The application is responsible for serializing access.
- More flexible, slightly more overhead.
MPI_THREAD_MULTIPLE- Multiple threads may call MPI routines concurrently.
- Most flexible but can be highest overhead, depending on implementation.
Typical hybrid designs try to use MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED to avoid the extra overhead of MPI_THREAD_MULTIPLE, unless there is a clear need for fully concurrent MPI calls from multiple threads.
OpenMP Usage Patterns with MPI
Some common patterns of mixing MPI and OpenMP include:
- MPI outside, OpenMP inside
- Outer loops over time steps or global stages use MPI for communication between subdomains.
- Inner loops over local data are parallelized with OpenMP.
- MPI communication in master thread only
#pragma omp parallelregions where only one thread (e.g.,omp masteroromp single) performs MPI calls, while others do compute.- Suitable for
MPI_THREAD_FUNNELED. - Overlap of communication and computation
- Use nonblocking MPI calls from one thread to initiate communication.
- Other threads perform computation on independent data while communication is in progress.
- Requires careful design to avoid race conditions and ensure data dependencies are respected.
Programming Models for Hybrid CPU–GPU Codes
Hybrid CPU–GPU codes often combine MPI with a GPU programming model:
- MPI + CUDA
- MPI + OpenACC
- MPI + vendor-specific GPU frameworks (e.g., HIP, SYCL in some setups)
Rank-to-GPU Mapping
A crucial design decision is how MPI ranks map to GPUs:
- One MPI rank per GPU (most common)
- Each rank calls GPU kernels on its own GPU(s).
- Threading is either minimal or used for CPU-side tasks (e.g., OpenMP for host computations).
- Multiple ranks per GPU
- Less common, can cause contention and complexity.
- Sometimes used to match a legacy MPI-only code structure, then port pieces to GPU.
- One rank controlling multiple GPUs
- Used in some multi-GPU nodes when a single process orchestrates multiple devices.
- More complex to program and load-balance.
The mapping choice has implications for:
- MPI communicator layouts (e.g., communicators per node, per GPU)
- Load balancing and domain decomposition
- Use of GPU-direct or GPU-aware MPI features
GPU-Aware MPI
On GPU clusters, many MPI implementations support:
- Direct sending and receiving of device memory buffers (GPU-aware MPI)
- Peer-to-peer device transfers
- NVLink or similar high-bandwidth interconnects
Hybrid designs can:
- Use MPI to exchange data directly between GPUs across nodes.
- Overlap GPU kernels with MPI communication, often via CUDA streams and nonblocking MPI.
Exact APIs and performance details depend on the specific hardware and MPI implementation, but the hybrid concept is:
- Decompose work across ranks (MPI).
- Within each rank, keep data resident on the GPU as much as possible.
- Minimize transfers between host and device and between GPUs.
Node-Level Parallelism in Hybrid Codes
Node-level parallelism concerns how you exploit all the resources on a single compute node using threads and/or accelerators.
CPU Node-Level Parallelism
On CPU-only nodes, the primary concerns are:
- Core utilization
- Assign threads so that each hardware core is used effectively.
- Avoid oversubscription (more runnable threads than cores without good reason).
- NUMA awareness
- Ensure threads mostly access memory local to their NUMA region.
- Align data placement with the NUMA locality of the MPI rank or thread group.
- Synchronization strategy
- Use OpenMP constructs that match your workload (e.g.,
parallel for,sections,tasks). - Minimize synchronization overheads inside hot regions.
Hybrid codes often:
- Create one MPI rank per socket or NUMA region.
- Spawn an OpenMP team restricted to that locality.
- Use first-touch memory allocation strategies to lay out data correctly in memory.
GPU Node-Level Parallelism
On GPU nodes, node-level parallelism includes:
- Multiple GPUs per node
- Multiple MPI ranks, each bound to a distinct GPU, or a single rank controlling multiple GPUs.
- Use CUDA streams or similar mechanisms to overlap transfers and compute.
- Interaction between CPUs and GPUs
- CPUs handle MPI communications and host-side logic.
- GPUs handle compute-intensive kernels.
- Combine CPU parallelism (threads) and GPU parallelism (kernels and blocks) carefully to avoid idle periods.
For beginners, a typical stepwise path is:
- Start with an MPI-only code.
- Introduce GPU offload per MPI rank.
- Optionally introduce CPU threading on top if beneficial.
Cluster-Level Parallelism in Hybrid Codes
Cluster-level parallelism is primarily managed via distributed-memory mechanisms (e.g., MPI). In a hybrid code, this level should:
- Align with node architecture
- Use node-level topology (sockets, GPUs) to guide rank distribution across the cluster.
- Request appropriate resources in job scripts (e.g.,
--nodes,--ntasks-per-node,--gpus-per-nodein SLURM). - Use communicators for subgroups
- Create MPI communicators that group ranks by node, by socket, by GPU, etc.
- Facilitate node-local operations (e.g., shared memory windows, reduction operations) separate from inter-node ones.
- Balance global workload
- Ensure each node’s total workload (across all its ranks and threads/GPUs) is approximately equal.
- Avoid hot spots where some nodes are overloaded while others are underutilized.
Hybrid designs let you craft different strategies for:
- Inter-node communication patterns (e.g., halo exchange only between neighbors).
- Intra-node work/failure domains (e.g., isolating failures or stragglers).
Common Hybrid Programming Patterns
Several recurring design patterns appear across many hybrid applications. Recognizing them helps both in reading existing codes and designing new ones.
Pattern 1: MPI Domains + OpenMP Loop Parallelism
Idea: Each MPI rank owns a large subdomain; within that subdomain, OpenMP parallelizes inner loops.
- Typical in finite-difference or finite-volume PDE codes.
- MPI handles halo exchange at the boundaries of subdomains.
- OpenMP uses
parallel fordirectives over spatial or temporal loops.
Characteristics:
- Easy to retrofit into existing MPI codes.
- Works well if each subdomain is large enough to keep threads busy.
- Thread synchronization overhead is minimal when loops are long and regular.
Pattern 2: MPI with Thread-Parallel Tasks
Idea: MPI for domain decomposition; OpenMP or another threading library to handle irregular or task-based parallelism within each rank.
- Suitable for adaptive mesh refinement (AMR), tree codes, or irregular graphs.
- Use OpenMP tasks or similar to distribute work over uneven data structures.
Characteristics:
- More complex to debug and tune.
- Can yield better load balance inside each rank than simple loop parallelism.
Pattern 3: MPI Rank per GPU + GPU Kernels
Idea: One MPI rank per GPU, each managing a subset of the global data, offloaded to the device.
- MPI used for halo exchanges and global reductions.
- Most computational kernels run on the GPU; the CPU runs MPI and host control.
- Sometimes combined with OpenMP on the host to overlap tasks.
Characteristics:
- Common in modern GPU-accelerated simulations.
- Data movement between host and device must be carefully managed.
- Often uses GPU-aware MPI to minimize overhead.
Pattern 4: Master–Worker Within a Node
Idea: A node-level “master” thread performs MPI communication and management, while worker threads handle computation.
- Aligns with
MPI_THREAD_FUNNELEDorMPI_THREAD_SERIALIZED. - Can be implemented via OpenMP’s
singleor explicit threading.
Characteristics:
- Simplifies MPI usage by funneling it through a single control point.
- Facilitates communication–computation overlap if workers continue compute while the master manages messages.
Pattern 5: Hierarchical Decomposition
Idea: Parallelism is split over multiple hierarchical levels:
- MPI over nodes
- OpenMP across sockets within a node
- Vectorization inside loops
- Optional GPU-level parallelism
Characteristics:
- Matches hardware hierarchy closely.
- Potentially complex, but can offer strong scalability and efficiency.
Challenges and Pitfalls in Hybrid Programming
Hybrid approaches bring additional complexity beyond single-model parallelism. Some typical issues:
Increased Complexity and Maintenance Cost
- More APIs to understand and use correctly.
- Harder debugging: errors can arise from interactions between MPI and threads or GPUs.
- Longer development and testing time.
For many applications, a simple model (e.g., pure MPI or MPI + GPU) might be sufficient, and hybrid complexity must be justified by real performance gains.
Load Imbalance Across Levels
- Even if MPI ranks are balanced, threads inside a rank might not be.
- GPU kernels must also be balanced to avoid idle devices.
- Amdahl’s Law still applies: un-parallelized or poorly balanced parts can dominate runtime.
Hybrid codes require performance analysis at multiple levels:
- Between nodes
- Within nodes
- Within devices
NUMA and Memory Locality Issues
Hybrid CPU codes can suffer from:
- Threads accessing memory on remote NUMA nodes.
- Inappropriate binding of MPI ranks and threads to cores.
Symptoms include:
- Good scaling up to a few cores, then flattening or slowdowns.
- Performance very sensitive to
OMP_PROC_BIND,numactl, or job scheduler binding options.
Thread-Safety and Race Conditions with MPI
When multiple threads may interact with MPI:
- Misuse of
MPI_THREAD_MULTIPLEor incorrect assumptions about MPI thread safety can lead to race conditions. - Some MPI libraries may have limited or costly support for high thread levels.
Practical advice:
- Restrict MPI calls to one thread when possible (
FUNNELEDpattern). - If multiple threads must call MPI, ensure rigorous synchronization.
Debugging and Profiling Complexity
- Tools may handle MPI and threads separately; combining them can require specialized MPI+OpenMP/GPU profilers.
- Timestamps and event ordering across ranks and threads can become difficult to interpret.
A structured approach to performance analysis—profiling at each level separately, then combined—is essential.
When (and When Not) to Use Hybrid Programming
Hybrid parallel programming is not always the right choice. Situations where it makes sense include:
- Memory constraints
- Pure MPI would require too many ranks per node, exceeding memory or hitting OS limits (e.g., file handles, MPI buffers).
- Scaling limits with pure MPI
- Adding more MPI ranks no longer improves performance, but many cores are idle.
- Threading within each rank can help exploit additional cores.
- Need for GPU acceleration
- Modern leadership systems use GPUs or other accelerators; hybrid models are often required to use them effectively.
- Clear hierarchical structure in the algorithm
- Natural decomposition into coarse tasks for MPI and fine-grained loops or tasks for OpenMP or GPU kernels.
Hybrid programming may not be necessary if:
- The problem size is small; single-node or single-model parallelism is enough.
- The code is simple and must remain very easy to maintain.
- Effective performance can be achieved with a simpler pattern (e.g., pure MPI+GPU with no CPU threading).
Practical Getting-Started Strategy
For absolute beginners, a practical roadmap to hybrid codes might be:
- Start with a correct, reasonably efficient serial code.
- Add MPI for domain decomposition across multiple nodes.
- Introduce OpenMP inside each MPI rank to parallelize the most expensive loops or tasks.
- Measure performance at each step to confirm actual benefits.
- (Optional) Add GPU support:
- Offload critical kernels while keeping MPI+OpenMP structure on the host.
- Gradually move more computation to the GPU.
- Iterate on placement and scaling
- Experiment with different rank/thread configurations, affinities, and resource requests in job scripts.
- Evaluate strong and weak scaling behavior.
Focusing on a small set of clear hybrid patterns and gradually refining them is typically more productive than trying to apply every available technique at once.