Kahibaro
Discord Login Register

Combining MPI and OpenMP

Why Combine MPI and OpenMP?

Hybrid programming means using MPI between nodes and OpenMP within nodes (or sockets). The motivation was introduced in the parent chapter; here we focus on how the combination actually works in practice and what is specific to running both together.

Typical goals:

A common mental model:

Typical Hybrid Execution Models

One MPI Process per Node

This model is often used for memory-bound codes where data is easily shared by all threads on a node.

One MPI Process per Socket (or NUMA Domain)

This is a very common and robust baseline configuration.

One MPI Process per Core (or Hardware Thread)

Hybrid designs usually move away from this extreme unless the code is hard to thread.

Choosing a Model

Key factors:

You typically experiment with:

Basic Structure of a Hybrid MPI + OpenMP Program

A typical C / C++ skeleton (Fortran is analogous):

#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
    // Always check the threading level actually provided
    if (provided < MPI_THREAD_FUNNELED) {
        // handle error or fall back to another mode
    }
    int world_rank, world_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    // Optionally set number of threads (else use OMP_NUM_THREADS)
    // omp_set_num_threads(4);
    // Serial + MPI region
    // ...
    // Parallel region with OpenMP
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        // Example: only one thread calls MPI
        #pragma omp master
        {
            // MPI call from the master thread
            // MPI_Send / MPI_Recv / MPI_Bcast / ...
        }
        // OpenMP-only computation here
        // ...
    }
    MPI_Finalize();
    return 0;
}

Key points that are specific to hybrid programming:

MPI Threading Levels and Their Impact

When combining MPI and OpenMP, you must tell MPI what level of thread support you need. This is done via:

$$
\texttt{MPI\_Init\_thread(argc, argv, required, provided)}
$$

The required and provided values are integers representing:

Hybrid codes often choose the lowest level that satisfies their needs, to minimize MPI overhead. Many applications work well with MPI_THREAD_FUNNELED:

Basic Collaboration Pattern: MPI Outside, OpenMP Inside

A very common hybrid pattern:

  1. MPI decides global data decomposition (which process works on which subset).
  2. Each process works on its local subset using OpenMP to spread work across threads.
  3. MPI communicates boundary data, global reductions, etc.

Conceptually:

Simple example structure (pseudocode):

MPI_Init_thread(...);
// Decide local domain, based on rank and world_size
for (time_step = 0; time_step < T; ++time_step) {
    // COMPUTE: thread-parallel region
    #pragma omp parallel for
    for (i = local_start; i < local_end; ++i) {
        // update local data[i]
    }
    // COMMUNICATE: halo exchange with neighboring ranks
    MPI_Sendrecv(...);
    // or MPI_Irecv / MPI_Isend + MPI_Wait...
}
MPI_Finalize();

MPI is mostly used in serial sections; OpenMP is used inside compute-heavy regions. This pattern aligns well with MPI_THREAD_FUNNELED.

Managing MPI Calls in OpenMP Regions

When OpenMP parallel regions and MPI calls must coexist, you need to organize who talks to MPI.

Master-Only Communication (FUNNELED)

Used with MPI_THREAD_FUNNELED:

#pragma omp parallel
{
    // Master thread handles MPI
    #pragma omp master
    {
        MPI_Recv(...);
        MPI_Send(...);
    }
    // Barrier can be implicit or explicit depending on needs
    #pragma omp barrier
    // All threads use the newly received data
    #pragma omp for
    for (int i = 0; i < N; ++i) {
        // compute with data
    }
}

Serialized MPI Calls (SERIALIZED)

Used with MPI_THREAD_SERIALIZED:

#pragma omp parallel
{
    // Some threads might need to communicate
    if (need_to_communicate) {
        #pragma omp critical(mpi_comm)
        {
            MPI_Send(...);
            MPI_Recv(...);
        }
    }
    // Other work, possibly in parallel
}

Fully Multithreaded Communication (MULTIPLE)

Used with MPI_THREAD_MULTIPLE:

MPI_Init_thread(..., MPI_THREAD_MULTIPLE, &provided);
#pragma omp parallel
{
    // In principle, any thread can call MPI without special restrictions
    if (condition) {
        MPI_Isend(...);
        MPI_Irecv(...);
    }
}

Process and Thread Affinity in Hybrid Runs

Hybrid performance is very sensitive to where processes and threads run.

Key aspects:

Typical environment configuration:

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=cores

Launch example (SLURM-like, just conceptual):

srun --ntasks-per-node=2 --cpus-per-task=8 ./my_hybrid_app

This coordination between scheduler, MPI, and OpenMP is central to a successful hybrid configuration.

Memory Usage Considerations in Hybrid Codes

Combining MPI and OpenMP changes memory behavior:

Hybrid designs often arise from memory pressure: MPI-only runs may not fit in memory when the number of ranks is large.

Algorithm and Decomposition Choices in a Hybrid Setting

Combining MPI and OpenMP can change how you decompose the problem:

Common Pitfalls Specific to Combining MPI and OpenMP

Oversubscribing Cores

Inconsistent Threading Configuration

Incorrect MPI Thread Level Assumptions

Hidden Global Synchronizations

Debugging Complexity

Incremental Path to Hybridization

For existing MPI codes:

  1. Identify compute-heavy loops inside each MPI process.
  2. Add OpenMP pragmas to those loops, keeping MPI calls outside of parallel regions.
  3. Start with MPI_THREAD_FUNNELED and a small number of threads.
  4. Set a simple rank layout (e.g., 1 rank per socket) and appropriate OMP_NUM_THREADS.
  5. Measure performance and memory usage; adjust rank/thread configuration.
  6. Only if needed, experiment with more advanced patterns (e.g., overlapping communication and computation, MPI_THREAD_MULTIPLE, or more complicated OpenMP constructs).

For existing OpenMP codes:

  1. Add MPI-based domain decomposition at the outer level.
  2. Keep most OpenMP parallelism inside the per-domain computations.
  3. Start with small numbers of ranks and nodes, then scale out.

Summary of Key Design Decisions in Hybrid MPI + OpenMP

When combining MPI and OpenMP, you must explicitly decide:

These decisions differentiate a simple “MPI plus threads” program from a well-designed hybrid HPC application.

Views: 12

Comments

Please login to add a comment.

Don't have an account? Register now!