9.5 Common hybrid programming patterns

Table of Contents

Overview of Hybrid Programming Patterns

Hybrid programming combines multiple parallel models—typically MPI between nodes and threads (OpenMP, CUDA, etc.) within nodes. Rather than using all possible combinations, most production codes rely on a small set of recurring patterns.

This chapter focuses on those common patterns: how they are structured conceptually, what they are good for, and what trade-offs they introduce. Detailed syntax for MPI, OpenMP, or CUDA is covered elsewhere; here we focus on design shapes.

Pattern 1: Pure MPI vs. Hybrid MPI + Threads

Before listing specific patterns, it helps to contrast a “baseline”:

Pure MPI: One MPI process per core (or even per hardware thread). All parallelism is between processes.
Hybrid MPI + threads: Fewer MPI processes per node, each process spawns multiple threads to use all cores.

Typical motivations for moving from pure MPI to hybrid:

Reduce the number of MPI processes (and thus communication overhead).
Exploit shared-memory efficiently within a node.
Reduce memory usage by sharing data among threads inside a process.

Most patterns below are about how to divide work across MPI processes and threads in a sensible way.

Pattern 2: MPI Between Nodes, OpenMP Within Nodes

This is the canonical hybrid pattern for CPU-only systems.

Conceptual structure

MPI level (inter-node):

Each node runs one or a few MPI processes.
MPI handles domain decomposition and data exchange between nodes.

OpenMP level (intra-node):

Each MPI process starts multiple threads to use the cores on its node.
OpenMP parallelizes the “local” work inside the MPI process.

Conceptually:

MPI decomposes the global problem into subdomains.
Each process owns one or more subdomains.
Within each process, OpenMP parallelizes loops or tasks on that subdomain.

Common variations

1 MPI process per node, many threads: Simple to think about; can reduce memory and communication, but may stress memory bandwidth and NUMA locality.
1 MPI process per socket, moderate threads per process: Balances memory locality (each socket has its own memory) with reduced MPI process counts.
1 MPI process per NUMA domain, few threads: Common on systems with multiple NUMA regions per node.

Where it’s useful

Stencil computations on structured grids (e.g., CFD, PDE solvers).
Linear algebra operations on sub-blocks of matrices.
Particle codes where each rank owns a region of space and uses threads to process particles in that region.

Pattern 3: MPI + OpenMP Loop-Level Parallelism

In this pattern, MPI handles the high-level decomposition, and OpenMP is used only for simple loop-level parallelism inside each process.

Structure

MPI decomposes the global data (e.g., block of a matrix, tile of a grid).
Inside a computational kernel, OpenMP parallelizes loops over:

Grid points
Rows/columns of matrices
Particles/objects in a local list

Pseudo-structure:

MPI_Init(...);
determine_subdomain(...);
for (time_step = 0; time_step < T; ++time_step) {
    exchange_halos_with_MPI(...);
    #pragma omp parallel for
    for (i = local_start; i < local_end; ++i) {
        // compute update on local data
    }
}
MPI_Finalize();

Characteristics

Simple: Minimal code changes if serial loops are already present.
Good starting point: Often used as the first hybrid step from pure MPI.
Scalability limit: Does not exploit more advanced patterns like tasking or nested parallel regions.

Use cases

Time-stepping solvers where each step involves straightforward loops with no complex dependencies.
Codes that already have a stable MPI decomposition and want a modest performance boost on multi-core nodes.

Pattern 4: MPI + OpenMP Task-Based Parallelism

Here, MPI still manages the distributed decomposition, but within each process, OpenMP tasks are used instead of (or in addition to) simple parallel loops.

Structure

MPI: same as before, each process owns a chunk of the global problem.
Within each process:

Work is organized as many relatively small tasks.
OpenMP tasks are created and dynamically scheduled across threads.
Dependencies can be expressed to preserve correctness.

Conceptual example:

// MPI sets up process-local data structures with multiple independent sub-problems
MPI_Setup(...);
#pragma omp parallel
{
  #pragma omp single
  {
    for (int k = 0; k < num_local_subproblems; ++k) {
      #pragma omp task
      {
        solve_subproblem(k);
      }
    }
  }
}
MPI_Exchange_Results(...);

Characteristics

Flexible load balancing inside a process:

Tasks can be scheduled dynamically so idle threads pick up extra work.

Handles irregular workloads better than static loop parallelism.
Requires more careful design to avoid too fine-grained tasks (overhead).

Typical applications

Adaptive mesh refinement where some blocks require more work than others.
Sparse linear algebra where row/column workloads differ widely.
Multi-physics with independent or loosely coupled components per MPI rank.

Pattern 5: Domain Decomposition Hybrid (MPI Across, Threads Within)

Many simulation codes use geometric or topological domain decomposition. The hybrid pattern is:

MPI level: Decompose the global physical domain into subdomains.
Thread level: Further partition or parallelize work within each subdomain.

Structure

Each MPI process owns a subdomain:

Regular 1D/2D/3D blocks for structured grids.
Collections of elements/cells for unstructured meshes.

With OpenMP:

Parallelize loops over cells/elements/edges.
Possibly assign chunks of the subdomain to different threads.

Conceptual flow:

MPI partition: assign blocks/elements to ranks.
For each timestep:

Exchange boundary data between MPI neighbors.
Thread-level parallel update of interior cells.
Possibly thread-parallel update of boundary cells.

Variants

Block + thread: MPI ranks own non-overlapping blocks; threads operate over slices or tiles.
Coloring + thread: Within a subdomain, use graph coloring or similar schemes to allow conflict-free threaded updates.

When it shines

Grid-based PDE solvers.
Finite element/finite volume methods.
Any simulation with relatively regular decompositions, where main work is nearest-neighbor or local operations.

Pattern 6: “Master MPI + Worker Threads” Within Each Rank

In this pattern, one thread per MPI process handles communication and orchestration, while other threads focus on computation.

Structure

MPI process starts an OpenMP parallel region.
Inside, designate:

A master thread (or a small group) to:

Perform MPI communication.
Coordinate tasks.
Manage global control flow.

Worker threads to:

Execute computational kernels.
Possibly operate on tasks scheduled by the master.

Conceptual outline:

MPI_Init_thread(..., MPI_THREAD_FUNNELED, ...);
#pragma omp parallel
{
  if (omp_get_thread_num() == 0) {
    // master thread: perform MPI communication
    manage_mpi_and_tasks();
  } else {
    // worker threads: perform computations
    do_computation_work();
  }
}
MPI_Finalize();

Characteristics

MPI calls are “funneled” through a designated thread (avoids full MPI thread safety requirements).
Can overlap computation and communication:

Master thread communicates while workers compute on already-available data.

Requires careful coordination to avoid idle time and race conditions.

Use cases

Latency-sensitive codes that benefit from communication/computation overlap.
Codes where MPI threading support is limited (e.g., only MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED).

Pattern 7: MPI + GPU + CPU Threads (Three-Level Hybrid)

On GPU-accelerated systems, a common hybrid pattern adds GPUs to the MPI + CPU threads model.

Structure

MPI between nodes and often between GPUs:

Typically one MPI rank per GPU or per subset of GPUs.

Within each MPI process:

CPU threads (OpenMP) to:

Prepare data for GPU.
Launch GPU kernels.
Perform CPU-side computations.

GPU(s) handle the heavy numerical kernels.

Conceptual layout:

MPI partitions the global problem across ranks.
Each rank:

Owns one or more GPUs.
Uses threads to manage multiple GPU streams or CPU-side work.

Data movement:

MPI → host buffers → GPU(s) and back.

Variants

1 MPI rank per GPU:

Clean mapping: each rank talks to one GPU.
Simpler code, but more MPI processes.

1 MPI rank per node, multiple GPUs:

One MPI process controls several GPUs.
OpenMP threads can manage separate GPUs or different streams on the same GPU.

Use cases

Large-scale GPU-accelerated solvers (e.g., molecular dynamics, climate models).
Applications that split work between CPU and GPU (e.g., GPU does dense math, CPU does control / sparse parts).

Pattern 8: MPI + Threaded Libraries

Sometimes the application itself uses MPI only, but links to libraries that provide threaded parallelism internally.

Structure

Application:

Written as pure MPI (no explicit OpenMP pragmas in user code).

Libraries:

Use OpenMP or another threading model inside functions like dgemm, FFTs, or solver routines.

The resulting execution:

MPI processes run on multiple nodes.
Inside each process, library calls spawn threads and use all cores.

Example pattern:

MPI_Init(...);
// Decompose data with MPI
// Local work uses threaded BLAS/LAPACK
dgemm(...);  // internally uses OpenMP
MPI_Finalize();

Characteristics

Minimal changes to application code.
Relies on:

Correctly setting thread counts in libraries (OMP_NUM_THREADS, library-specific environment variables).
Avoiding over-subscription (too many threads per node due to multiple ranks each spawning many threads).

Use cases

Codes that are primarily composed of calls to well-optimized numerical libraries.
Transitional phase before full explicit hybridization of user code.

Pattern 9: Checkerboard / Nested Decomposition

Some applications use multiple levels of decomposition both at the MPI and thread levels.

Structure

MPI:

Decompose into large, coarse-grained subdomains.

Within each rank:

Further decompose into tiles, blocks, or patches.
Use threads to work over these tiles, often with cache-friendly blocking.

Conceptual view:

MPI rank owns a large subdomain (e.g., a 3D block).
It divides this into smaller tiles.
OpenMP parallelizes over tiles or over loops within tiles.

This is sometimes called:

Nested decomposition (because of decomposition on top of decomposition).
Checkerboard decomposition in structured grid contexts.

Benefits

Improved cache locality and memory bandwidth usage.
Better control over granularity:

MPI handles only large chunks.
Threads handle smaller tiles with less communication overhead.

Use cases

Large structured-grid simulations (e.g., weather, ocean models).
Applications where cache blocking is critical for performance.

Pattern 10: Hybrid Pipeline / Staged Processing

In some workflows, the computation naturally breaks into stages (e.g., read → preprocess → simulate → postprocess). A hybrid pattern can assign:

MPI to distribute large data sets or simulation regions across nodes.
Threads to process different stages or items in a pipeline fashion.

Structure

Each MPI rank:

Owns a subset of the data or simulation instances.
Organizes work as a pipeline of stages.

Threads:

Work on different items at different stages simultaneously.
Example:

Thread 0 reads/prepares data.
Thread 1 does main computation.
Thread 2 writes results.

Conceptual idea:

Exploit concurrency between stages (stage parallelism) in addition to data parallelism between MPI ranks.

Use cases

Data-intensive workflows with I/O, preprocessing, simulation, and analysis.
Ensemble simulations where many independent runs are processed through the same pipeline.

Pattern 11: Ensemble or Replicated Simulations (MPI for Many, Threads Within Each)

Some workloads consist of many independent or weakly coupled simulations (ensembles, parameter sweeps, Monte Carlo). A hybrid pattern is:

MPI:

Distribute independent simulations across nodes/ranks.

Threads within each rank:

Parallelize the simulation itself (e.g., particles, grid cells).
Or process multiple small instances in parallel on a node.

Structure

MPI rank-level:

Each rank is responsible for one or more ensemble members.

OpenMP:

Parallelize each member’s computation.
Alternatively, parallelize over multiple ensemble members held by a single rank.

Benefits

High throughput for parameter studies and uncertainty quantification.
Flexible mapping:

Use threads to adapt to varying simulation sizes; some ensemble members may be smaller or larger.

Choosing a Hybrid Pattern

Selecting a pattern is primarily driven by:

Application structure:

Regular grids → domain decomposition hybrid.
Irregular, dynamic workloads → task-based hybrids.
Library-heavy codes → MPI + threaded libraries.

Hardware:

Multi-socket CPUs with NUMA → 1 rank per socket, threads within.
GPU nodes → MPI + GPU + CPU threads.

Software constraints:

MPI thread support level.
Existing code structure and maintainability considerations.

In practice, production codes often combine several patterns. For example:

MPI domain decomposition + OpenMP loop parallelism + threaded BLAS.
MPI + GPU, plus CPU threads for data preparation and auxiliary tasks.

Understanding these patterns helps you read, design, and reason about hybrid codes, rather than treating MPI, OpenMP, and accelerators as separate, unrelated tools.

Comments

Please login to add a comment.

Don't have an account? Register now!

9.5 Common hybrid programming patterns

Overview of Hybrid Programming Patterns

Pattern 1: Pure MPI vs. Hybrid MPI + Threads

Pattern 2: MPI Between Nodes, OpenMP Within Nodes

Conceptual structure

Common variations

Where it’s useful

Pattern 3: MPI + OpenMP Loop-Level Parallelism

Structure

Characteristics

Use cases

Pattern 4: MPI + OpenMP Task-Based Parallelism

Structure

Characteristics

Typical applications

Pattern 5: Domain Decomposition Hybrid (MPI Across, Threads Within)

Structure

Variants

When it shines

Pattern 6: “Master MPI + Worker Threads” Within Each Rank

Structure

Characteristics

Use cases

Pattern 7: MPI + GPU + CPU Threads (Three-Level Hybrid)

Structure

Variants

Use cases

Pattern 8: MPI + Threaded Libraries

Structure

Characteristics

Use cases

Pattern 9: Checkerboard / Nested Decomposition

Structure

Benefits

Use cases

Pattern 10: Hybrid Pipeline / Staged Processing

Structure

Use cases

Pattern 11: Ensemble or Replicated Simulations (MPI for Many, Threads Within Each)

Structure

Benefits

Choosing a Hybrid Pattern

Comments

Where to Move