Kahibaro
Discord Login Register

Common hybrid programming patterns

Overview of Hybrid Programming Patterns

Hybrid programming combines multiple parallel models—typically MPI between nodes and threads (OpenMP, CUDA, etc.) within nodes. Rather than using all possible combinations, most production codes rely on a small set of recurring patterns.

This chapter focuses on those common patterns: how they are structured conceptually, what they are good for, and what trade-offs they introduce. Detailed syntax for MPI, OpenMP, or CUDA is covered elsewhere; here we focus on design shapes.

Pattern 1: Pure MPI vs. Hybrid MPI + Threads

Before listing specific patterns, it helps to contrast a “baseline”:

Typical motivations for moving from pure MPI to hybrid:

Most patterns below are about how to divide work across MPI processes and threads in a sensible way.

Pattern 2: MPI Between Nodes, OpenMP Within Nodes

This is the canonical hybrid pattern for CPU-only systems.

Conceptual structure

Conceptually:

  1. MPI decomposes the global problem into subdomains.
  2. Each process owns one or more subdomains.
  3. Within each process, OpenMP parallelizes loops or tasks on that subdomain.

Common variations

Where it’s useful

Pattern 3: MPI + OpenMP Loop-Level Parallelism

In this pattern, MPI handles the high-level decomposition, and OpenMP is used only for simple loop-level parallelism inside each process.

Structure

Pseudo-structure:

MPI_Init(...);
determine_subdomain(...);
for (time_step = 0; time_step < T; ++time_step) {
    exchange_halos_with_MPI(...);
    #pragma omp parallel for
    for (i = local_start; i < local_end; ++i) {
        // compute update on local data
    }
}
MPI_Finalize();

Characteristics

Use cases

Pattern 4: MPI + OpenMP Task-Based Parallelism

Here, MPI still manages the distributed decomposition, but within each process, OpenMP tasks are used instead of (or in addition to) simple parallel loops.

Structure

Conceptual example:

// MPI sets up process-local data structures with multiple independent sub-problems
MPI_Setup(...);
#pragma omp parallel
{
  #pragma omp single
  {
    for (int k = 0; k < num_local_subproblems; ++k) {
      #pragma omp task
      {
        solve_subproblem(k);
      }
    }
  }
}
MPI_Exchange_Results(...);

Characteristics

Typical applications

Pattern 5: Domain Decomposition Hybrid (MPI Across, Threads Within)

Many simulation codes use geometric or topological domain decomposition. The hybrid pattern is:

  1. MPI level: Decompose the global physical domain into subdomains.
  2. Thread level: Further partition or parallelize work within each subdomain.

Structure

Conceptual flow:

  1. MPI partition: assign blocks/elements to ranks.
  2. For each timestep:
    • Exchange boundary data between MPI neighbors.
    • Thread-level parallel update of interior cells.
    • Possibly thread-parallel update of boundary cells.

Variants

When it shines

Pattern 6: “Master MPI + Worker Threads” Within Each Rank

In this pattern, one thread per MPI process handles communication and orchestration, while other threads focus on computation.

Structure

Conceptual outline:

MPI_Init_thread(..., MPI_THREAD_FUNNELED, ...);
#pragma omp parallel
{
  if (omp_get_thread_num() == 0) {
    // master thread: perform MPI communication
    manage_mpi_and_tasks();
  } else {
    // worker threads: perform computations
    do_computation_work();
  }
}
MPI_Finalize();

Characteristics

Use cases

Pattern 7: MPI + GPU + CPU Threads (Three-Level Hybrid)

On GPU-accelerated systems, a common hybrid pattern adds GPUs to the MPI + CPU threads model.

Structure

Conceptual layout:

  1. MPI partitions the global problem across ranks.
  2. Each rank:
    • Owns one or more GPUs.
    • Uses threads to manage multiple GPU streams or CPU-side work.
  3. Data movement:
    • MPI → host buffers → GPU(s) and back.

Variants

Use cases

Pattern 8: MPI + Threaded Libraries

Sometimes the application itself uses MPI only, but links to libraries that provide threaded parallelism internally.

Structure

Example pattern:

MPI_Init(...);
// Decompose data with MPI
// Local work uses threaded BLAS/LAPACK
dgemm(...);  // internally uses OpenMP
MPI_Finalize();

Characteristics

Use cases

Pattern 9: Checkerboard / Nested Decomposition

Some applications use multiple levels of decomposition both at the MPI and thread levels.

Structure

Conceptual view:

  1. MPI rank owns a large subdomain (e.g., a 3D block).
  2. It divides this into smaller tiles.
  3. OpenMP parallelizes over tiles or over loops within tiles.

This is sometimes called:

Benefits

Use cases

Pattern 10: Hybrid Pipeline / Staged Processing

In some workflows, the computation naturally breaks into stages (e.g., read → preprocess → simulate → postprocess). A hybrid pattern can assign:

Structure

Conceptual idea:

Use cases

Pattern 11: Ensemble or Replicated Simulations (MPI for Many, Threads Within Each)

Some workloads consist of many independent or weakly coupled simulations (ensembles, parameter sweeps, Monte Carlo). A hybrid pattern is:

Structure

Benefits

Choosing a Hybrid Pattern

Selecting a pattern is primarily driven by:

In practice, production codes often combine several patterns. For example:

Understanding these patterns helps you read, design, and reason about hybrid codes, rather than treating MPI, OpenMP, and accelerators as separate, unrelated tools.

Views: 11

Comments

Please login to add a comment.

Don't have an account? Register now!