Table of Contents
Overview of Hybrid Programming Patterns
Hybrid programming combines multiple parallel models—typically MPI between nodes and threads (OpenMP, CUDA, etc.) within nodes. Rather than using all possible combinations, most production codes rely on a small set of recurring patterns.
This chapter focuses on those common patterns: how they are structured conceptually, what they are good for, and what trade-offs they introduce. Detailed syntax for MPI, OpenMP, or CUDA is covered elsewhere; here we focus on design shapes.
Pattern 1: Pure MPI vs. Hybrid MPI + Threads
Before listing specific patterns, it helps to contrast a “baseline”:
- Pure MPI: One MPI process per core (or even per hardware thread). All parallelism is between processes.
- Hybrid MPI + threads: Fewer MPI processes per node, each process spawns multiple threads to use all cores.
Typical motivations for moving from pure MPI to hybrid:
- Reduce the number of MPI processes (and thus communication overhead).
- Exploit shared-memory efficiently within a node.
- Reduce memory usage by sharing data among threads inside a process.
Most patterns below are about how to divide work across MPI processes and threads in a sensible way.
Pattern 2: MPI Between Nodes, OpenMP Within Nodes
This is the canonical hybrid pattern for CPU-only systems.
Conceptual structure
- MPI level (inter-node):
- Each node runs one or a few MPI processes.
- MPI handles domain decomposition and data exchange between nodes.
- OpenMP level (intra-node):
- Each MPI process starts multiple threads to use the cores on its node.
- OpenMP parallelizes the “local” work inside the MPI process.
Conceptually:
- MPI decomposes the global problem into subdomains.
- Each process owns one or more subdomains.
- Within each process, OpenMP parallelizes loops or tasks on that subdomain.
Common variations
- 1 MPI process per node, many threads: Simple to think about; can reduce memory and communication, but may stress memory bandwidth and NUMA locality.
- 1 MPI process per socket, moderate threads per process: Balances memory locality (each socket has its own memory) with reduced MPI process counts.
- 1 MPI process per NUMA domain, few threads: Common on systems with multiple NUMA regions per node.
Where it’s useful
- Stencil computations on structured grids (e.g., CFD, PDE solvers).
- Linear algebra operations on sub-blocks of matrices.
- Particle codes where each rank owns a region of space and uses threads to process particles in that region.
Pattern 3: MPI + OpenMP Loop-Level Parallelism
In this pattern, MPI handles the high-level decomposition, and OpenMP is used only for simple loop-level parallelism inside each process.
Structure
- MPI decomposes the global data (e.g., block of a matrix, tile of a grid).
- Inside a computational kernel, OpenMP parallelizes loops over:
- Grid points
- Rows/columns of matrices
- Particles/objects in a local list
Pseudo-structure:
MPI_Init(...);
determine_subdomain(...);
for (time_step = 0; time_step < T; ++time_step) {
exchange_halos_with_MPI(...);
#pragma omp parallel for
for (i = local_start; i < local_end; ++i) {
// compute update on local data
}
}
MPI_Finalize();Characteristics
- Simple: Minimal code changes if serial loops are already present.
- Good starting point: Often used as the first hybrid step from pure MPI.
- Scalability limit: Does not exploit more advanced patterns like tasking or nested parallel regions.
Use cases
- Time-stepping solvers where each step involves straightforward loops with no complex dependencies.
- Codes that already have a stable MPI decomposition and want a modest performance boost on multi-core nodes.
Pattern 4: MPI + OpenMP Task-Based Parallelism
Here, MPI still manages the distributed decomposition, but within each process, OpenMP tasks are used instead of (or in addition to) simple parallel loops.
Structure
- MPI: same as before, each process owns a chunk of the global problem.
- Within each process:
- Work is organized as many relatively small tasks.
- OpenMP tasks are created and dynamically scheduled across threads.
- Dependencies can be expressed to preserve correctness.
Conceptual example:
// MPI sets up process-local data structures with multiple independent sub-problems
MPI_Setup(...);
#pragma omp parallel
{
#pragma omp single
{
for (int k = 0; k < num_local_subproblems; ++k) {
#pragma omp task
{
solve_subproblem(k);
}
}
}
}
MPI_Exchange_Results(...);Characteristics
- Flexible load balancing inside a process:
- Tasks can be scheduled dynamically so idle threads pick up extra work.
- Handles irregular workloads better than static loop parallelism.
- Requires more careful design to avoid too fine-grained tasks (overhead).
Typical applications
- Adaptive mesh refinement where some blocks require more work than others.
- Sparse linear algebra where row/column workloads differ widely.
- Multi-physics with independent or loosely coupled components per MPI rank.
Pattern 5: Domain Decomposition Hybrid (MPI Across, Threads Within)
Many simulation codes use geometric or topological domain decomposition. The hybrid pattern is:
- MPI level: Decompose the global physical domain into subdomains.
- Thread level: Further partition or parallelize work within each subdomain.
Structure
- Each MPI process owns a subdomain:
- Regular 1D/2D/3D blocks for structured grids.
- Collections of elements/cells for unstructured meshes.
- With OpenMP:
- Parallelize loops over cells/elements/edges.
- Possibly assign chunks of the subdomain to different threads.
Conceptual flow:
- MPI partition: assign blocks/elements to ranks.
- For each timestep:
- Exchange boundary data between MPI neighbors.
- Thread-level parallel update of interior cells.
- Possibly thread-parallel update of boundary cells.
Variants
- Block + thread: MPI ranks own non-overlapping blocks; threads operate over slices or tiles.
- Coloring + thread: Within a subdomain, use graph coloring or similar schemes to allow conflict-free threaded updates.
When it shines
- Grid-based PDE solvers.
- Finite element/finite volume methods.
- Any simulation with relatively regular decompositions, where main work is nearest-neighbor or local operations.
Pattern 6: “Master MPI + Worker Threads” Within Each Rank
In this pattern, one thread per MPI process handles communication and orchestration, while other threads focus on computation.
Structure
- MPI process starts an OpenMP parallel region.
- Inside, designate:
- A master thread (or a small group) to:
- Perform MPI communication.
- Coordinate tasks.
- Manage global control flow.
- Worker threads to:
- Execute computational kernels.
- Possibly operate on tasks scheduled by the master.
Conceptual outline:
MPI_Init_thread(..., MPI_THREAD_FUNNELED, ...);
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
// master thread: perform MPI communication
manage_mpi_and_tasks();
} else {
// worker threads: perform computations
do_computation_work();
}
}
MPI_Finalize();Characteristics
- MPI calls are “funneled” through a designated thread (avoids full MPI thread safety requirements).
- Can overlap computation and communication:
- Master thread communicates while workers compute on already-available data.
- Requires careful coordination to avoid idle time and race conditions.
Use cases
- Latency-sensitive codes that benefit from communication/computation overlap.
- Codes where MPI threading support is limited (e.g., only
MPI_THREAD_FUNNELEDorMPI_THREAD_SERIALIZED).
Pattern 7: MPI + GPU + CPU Threads (Three-Level Hybrid)
On GPU-accelerated systems, a common hybrid pattern adds GPUs to the MPI + CPU threads model.
Structure
- MPI between nodes and often between GPUs:
- Typically one MPI rank per GPU or per subset of GPUs.
- Within each MPI process:
- CPU threads (OpenMP) to:
- Prepare data for GPU.
- Launch GPU kernels.
- Perform CPU-side computations.
- GPU(s) handle the heavy numerical kernels.
Conceptual layout:
MPIpartitions the global problem across ranks.- Each rank:
- Owns one or more GPUs.
- Uses threads to manage multiple GPU streams or CPU-side work.
- Data movement:
- MPI → host buffers → GPU(s) and back.
Variants
- 1 MPI rank per GPU:
- Clean mapping: each rank talks to one GPU.
- Simpler code, but more MPI processes.
- 1 MPI rank per node, multiple GPUs:
- One MPI process controls several GPUs.
- OpenMP threads can manage separate GPUs or different streams on the same GPU.
Use cases
- Large-scale GPU-accelerated solvers (e.g., molecular dynamics, climate models).
- Applications that split work between CPU and GPU (e.g., GPU does dense math, CPU does control / sparse parts).
Pattern 8: MPI + Threaded Libraries
Sometimes the application itself uses MPI only, but links to libraries that provide threaded parallelism internally.
Structure
- Application:
- Written as pure MPI (no explicit OpenMP pragmas in user code).
- Libraries:
- Use OpenMP or another threading model inside functions like
dgemm, FFTs, or solver routines. - The resulting execution:
- MPI processes run on multiple nodes.
- Inside each process, library calls spawn threads and use all cores.
Example pattern:
MPI_Init(...);
// Decompose data with MPI
// Local work uses threaded BLAS/LAPACK
dgemm(...); // internally uses OpenMP
MPI_Finalize();Characteristics
- Minimal changes to application code.
- Relies on:
- Correctly setting thread counts in libraries (
OMP_NUM_THREADS, library-specific environment variables). - Avoiding over-subscription (too many threads per node due to multiple ranks each spawning many threads).
Use cases
- Codes that are primarily composed of calls to well-optimized numerical libraries.
- Transitional phase before full explicit hybridization of user code.
Pattern 9: Checkerboard / Nested Decomposition
Some applications use multiple levels of decomposition both at the MPI and thread levels.
Structure
- MPI:
- Decompose into large, coarse-grained subdomains.
- Within each rank:
- Further decompose into tiles, blocks, or patches.
- Use threads to work over these tiles, often with cache-friendly blocking.
Conceptual view:
- MPI rank owns a large subdomain (e.g., a 3D block).
- It divides this into smaller tiles.
- OpenMP parallelizes over tiles or over loops within tiles.
This is sometimes called:
- Nested decomposition (because of decomposition on top of decomposition).
- Checkerboard decomposition in structured grid contexts.
Benefits
- Improved cache locality and memory bandwidth usage.
- Better control over granularity:
- MPI handles only large chunks.
- Threads handle smaller tiles with less communication overhead.
Use cases
- Large structured-grid simulations (e.g., weather, ocean models).
- Applications where cache blocking is critical for performance.
Pattern 10: Hybrid Pipeline / Staged Processing
In some workflows, the computation naturally breaks into stages (e.g., read → preprocess → simulate → postprocess). A hybrid pattern can assign:
- MPI to distribute large data sets or simulation regions across nodes.
- Threads to process different stages or items in a pipeline fashion.
Structure
- Each MPI rank:
- Owns a subset of the data or simulation instances.
- Organizes work as a pipeline of stages.
- Threads:
- Work on different items at different stages simultaneously.
- Example:
- Thread 0 reads/prepares data.
- Thread 1 does main computation.
- Thread 2 writes results.
Conceptual idea:
- Exploit concurrency between stages (stage parallelism) in addition to data parallelism between MPI ranks.
Use cases
- Data-intensive workflows with I/O, preprocessing, simulation, and analysis.
- Ensemble simulations where many independent runs are processed through the same pipeline.
Pattern 11: Ensemble or Replicated Simulations (MPI for Many, Threads Within Each)
Some workloads consist of many independent or weakly coupled simulations (ensembles, parameter sweeps, Monte Carlo). A hybrid pattern is:
- MPI:
- Distribute independent simulations across nodes/ranks.
- Threads within each rank:
- Parallelize the simulation itself (e.g., particles, grid cells).
- Or process multiple small instances in parallel on a node.
Structure
- MPI rank-level:
- Each rank is responsible for one or more ensemble members.
- OpenMP:
- Parallelize each member’s computation.
- Alternatively, parallelize over multiple ensemble members held by a single rank.
Benefits
- High throughput for parameter studies and uncertainty quantification.
- Flexible mapping:
- Use threads to adapt to varying simulation sizes; some ensemble members may be smaller or larger.
Choosing a Hybrid Pattern
Selecting a pattern is primarily driven by:
- Application structure:
- Regular grids → domain decomposition hybrid.
- Irregular, dynamic workloads → task-based hybrids.
- Library-heavy codes → MPI + threaded libraries.
- Hardware:
- Multi-socket CPUs with NUMA → 1 rank per socket, threads within.
- GPU nodes → MPI + GPU + CPU threads.
- Software constraints:
- MPI thread support level.
- Existing code structure and maintainability considerations.
In practice, production codes often combine several patterns. For example:
- MPI domain decomposition + OpenMP loop parallelism + threaded BLAS.
- MPI + GPU, plus CPU threads for data preparation and auxiliary tasks.
Understanding these patterns helps you read, design, and reason about hybrid codes, rather than treating MPI, OpenMP, and accelerators as separate, unrelated tools.