9.2 Combining MPI and OpenMP

Table of Contents

Why Combine MPI and OpenMP

Hybrid programming uses MPI between nodes and OpenMP within nodes. MPI handles distributed memory across the cluster. OpenMP handles shared memory inside a single node. Combining them allows you to match the hardware hierarchy of modern clusters, which have many nodes and many cores per node, sometimes with hardware threads.

In a pure MPI code every MPI process has its own memory space, even if several processes run on the same node. In a pure OpenMP code all threads share a single memory space, and the program usually runs on a single node. Hybrid codes try to reduce MPI overhead and memory duplication on each node, while still being able to scale across many nodes.

Basic Hybrid Execution Model

In a typical hybrid program you start several MPI processes, usually one or a few per node. Inside each MPI process you create OpenMP threads. Each MPI process becomes a group of cooperating threads that share memory on that node.

For example, you might run 4 nodes, 2 MPI processes per node, and 16 OpenMP threads per process. The total number of cores in use is then
$$
N_{\text{cores}} = N_{\text{nodes}} \times N_{\text{MPI per node}} \times N_{\text{threads per MPI}}.
$$

In a hybrid code, each MPI process sees memory as shared only among its own OpenMP threads. Memory is not shared across MPI processes.

The MPI layer is responsible for communication between nodes and between MPI processes on the same node. The OpenMP layer is responsible for exploiting parallel loops and sections inside each process.

Typical Hybrid Structure of a Program

A common structural pattern is:

Initialize MPI.
Query the MPI rank and communicator size.
Initialize OpenMP settings for threads.
Run a main time stepping or iteration loop where:

Data is decomposed across MPI ranks.
Within each rank, OpenMP parallel regions operate on the local subdomain.
MPI communication steps exchange halo or boundary data between ranks.

Finalize MPI.

In code this often appears as MPI calls around, or interleaved with, OpenMP parallel and work sharing constructs. The OpenMP regions are inside the MPI program, not the other way around.

A minimal hybrid skeleton can look like:

#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        // Hybrid work: MPI rank 'rank', OpenMP thread 'tid'
    }
    MPI_Finalize();
    return 0;
}

Here MPI creates processes. Each process then spawns OpenMP threads for parallel regions.

Domain Decomposition for Hybrid Codes

The key design step in a hybrid program is how you decompose the problem into subdomains for MPI and then into chunks for OpenMP.

At the MPI level you usually use a coarse domain decomposition. For a grid based problem you might decompose the global grid into large blocks, one per MPI rank or per group of ranks. MPI processes exchange boundary data between neighboring blocks.

Inside each MPI process the local block is further subdivided into work units that OpenMP threads can process. Often you keep the data as a single local array per MPI rank, then use OpenMP for constructs over the local index ranges.

For example, consider a 2D array with indices i = 0 .. NI-1 and j = 0 .. NJ-1. At the MPI level you could split the i direction into size strips, so rank r owns i_start[r] to i_end[r]. Inside each rank, an OpenMP loop might parallelize over j or over both i and j for that local subdomain.

The important point is that MPI decides which piece of the global data each rank owns. OpenMP decides how that local piece is shared among threads.

Managing MPI and OpenMP Initialization

MPI and OpenMP need slightly different initialization. OpenMP is normally enabled and configured by environment variables and compiler flags. MPI requires explicit calls to MPI_Init and MPI_Finalize. When you use them together, two additional concerns appear: the MPI thread support level and the order of initialization.

Hybrid programs should request an appropriate MPI thread support level with MPI_Init_thread rather than MPI_Init. This call lets MPI know whether multiple threads may call MPI functions.

A typical pattern is:

#include <mpi.h>
#include <omp.h>
#include <stdio.h>
int main(int argc, char** argv) {
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
    if (provided < MPI_THREAD_FUNNELED) {
        if (provided == MPI_THREAD_SINGLE) {
            if (omp_get_max_threads() > 1) {
                fprintf(stderr, "MPI does not support multithreaded use\n");
                MPI_Abort(MPI_COMM_WORLD, 1);
            }
        }
    }
    // OpenMP threads can be configured here, e.g. with omp_set_num_threads
    #pragma omp parallel
    {
        // Parallel work
    }
    MPI_Finalize();
    return 0;
}

The required thread level depends on your design. Common levels are:

MPI_THREAD_SINGLE, only one thread, no OpenMP parallelization calling MPI.

MPI_THREAD_FUNNELED, multiple threads, but only the main thread calls MPI.

MPI_THREAD_SERIALIZED, multiple threads can call MPI, but not at the same time.

MPI_THREAD_MULTIPLE, multiple threads can call MPI concurrently.

Most hybrid OpenMP MPI codes aim for MPI_THREAD_FUNNELED because it is easier to program and usually better supported and faster than MPI_THREAD_MULTIPLE.

If you plan to call MPI from multiple OpenMP threads, you must request MPI_THREAD_MULTIPLE and check if it is provided. Calling MPI from multiple threads without sufficient support can lead to incorrect results or crashes.

Common Hybrid Patterns

There are several common patterns for how MPI and OpenMP work together. The choice affects performance and code complexity.

A frequently used pattern is: "MPI outside, OpenMP inside loops." Your main control structure and communication follow the MPI decomposition, and you insert OpenMP parallel regions around compute intensive loops. Only the master thread of each MPI process performs MPI communication.

In code, this looks like:

// Outside: MPI rank owns a chunk of array A
for (int step = 0; step < nsteps; step++) {
    // Compute on local data in parallel
    #pragma omp parallel for
    for (int i = local_start; i < local_end; i++) {
        // Update A[i] using local data
    }
    // Single thread per rank handles communication
    // For example, halo exchange between ranks
    MPI_Sendrecv(...);
}

Another pattern uses OpenMP to overlap communication and computation by having one thread perform MPI calls while others compute. This still uses MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED, but requires careful use of OpenMP constructs like single, master, and barrier.

For example:

#pragma omp parallel
{
    // All threads compute interior points
    #pragma omp for nowait
    for (int i = interior_start; i < interior_end; i++) {
        // compute
    }
    // One thread performs MPI halo communication
    #pragma omp single
    {
        MPI_Sendrecv(...);
    }
    // After halo exchange, compute boundary points
    #pragma omp for
    for (int i = boundary_start; i < boundary_end; i++) {
        // compute
    }
}

A more advanced pattern allows multiple threads to call MPI for separate messages, but this needs MPI_THREAD_MULTIPLE and careful tuning, so it is often avoided for beginners.

Memory Layout and Affinity in Hybrid Codes

Because MPI processes do not share memory, and OpenMP threads do, you need to be conscious about where memory is allocated and which threads access it.

In many operating systems memory pages are physically allocated on first touch. This means the thread that first writes to a memory location determines which NUMA region of the node the memory is allocated on. For hybrid codes this can significantly affect performance.

A typical approach is to perform initialization in a parallel region such that threads allocate and initialize the portions they will later compute on. This matches memory placement with subsequent access patterns. For instance:

double *A = malloc(N * sizeof(double));
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    A[i] = 0.0; // first touch by the thread that will later use A[i]
}

Affinity and binding settings control which cores each MPI process and each OpenMP thread use. In job scripts and environment variables, you usually specify how many MPI tasks per node and how many OpenMP threads per task, and you may request that threads are pinned to cores. The exact options depend on the scheduler and MPI implementation, but the idea is to avoid cores being oversubscribed or threads migrating unnecessarily across cores.

For good performance you must keep a consistent mapping between MPI ranks, OpenMP threads, and hardware cores. Oversubscribing cores, or allowing too many processes or threads per core, usually hurts both performance and scaling.

Load Balancing in Hybrid Contexts

Load balancing becomes more layered in hybrid codes. At the MPI level you want similar work per process. At the OpenMP level you want similar work per thread within each process.

If you use a regular domain decomposition, and the work per grid point is similar, static partitioning at both levels often works. For example, each MPI rank owns an equal number of rows, and each OpenMP loop uses schedule(static) over the local rows.

For irregular problems or adaptive meshes, however, some MPI ranks may own more expensive parts of the domain, and within a rank some threads may have more work than others. In such cases you may combine coarse adjustment of MPI decomposition with OpenMP dynamic scheduling or more advanced approaches like work stealing or load balancing libraries. For beginners, the simplest step is often to keep the MPI decomposition as regular as possible and use OpenMP schedule(dynamic) on the innermost loops that show high variability in work.

You also need to remember that MPI communication patterns depend on your domain decomposition. Changing the MPI level decomposition to improve balance might change the communication volume and pattern.

Synchronization and Correctness

Hybrid codes introduce multiple forms of synchronization. MPI calls such as collective communications, blocking sends, and receives synchronize processes. OpenMP constructs such as barrier, critical, and atomic synchronize threads.

The combined effect can be surprising. For example, if one MPI rank arrives at a collective operation early but still has threads executing, and another rank arrives late with all threads waiting at a barrier, you can get unexpected idle times. Correctness is usually maintained by having a clear structure where:

MPI communications are called from well defined points in the program, usually outside OpenMP parallel regions or from a single thread.

OpenMP barriers and synchronization do not conflict with MPI message ordering.

In practice beginners often follow two simple rules: only the master thread calls MPI, and all threads in a process reach the same sequence of parallel regions and barriers. This reduces the risk of deadlocks and inconsistent program states.

Never allow one OpenMP thread to enter a blocking MPI call while another thread in the same process waits for that thread at a barrier. This combination can deadlock the process.

Choosing MPI Process and OpenMP Thread Counts

For a given machine with a fixed number of cores per node, you must choose how many MPI processes and how many OpenMP threads to use. This choice can affect memory usage, communication volume, and cache behavior.

Using more MPI ranks per node means more duplicated data structures, more MPI communication endpoints, and potentially more messages. It can, however, reduce the need for OpenMP synchronization and may be simpler to reason about.

Using fewer MPI ranks and more threads per rank reduces data duplication and can reduce MPI communication on-node, but may introduce more contention for shared data and require careful OpenMP tuning.

A typical starting point is to use one MPI process per NUMA region or per socket, and then use OpenMP threads to cover all cores in that region. For example, on a dual socket node with 32 cores per socket, you might run 2 MPI tasks per node and 32 OpenMP threads per task, for a total of 64 cores.

You usually control this through command line options to mpirun or srun and environment variables such as OMP_NUM_THREADS. The specific syntax depends on the job scheduler and MPI library.

Experimentation is often necessary. You can measure time to solution and memory usage for different process thread combinations and then choose the configuration that balances speed and resource usage for your particular code and problem size.

Example Design Process for a Hybrid Code

When turning a pure MPI code or a pure OpenMP code into a hybrid one, a simple sequence of design steps can help:

First, identify the natural MPI level decomposition. This is usually the same as in the existing pure MPI version. You decide how many subdomains, which rank owns which part, and what communication is needed.

Second, examine the hotspots inside each MPI process. Find loops or regions that can be parallelized with OpenMP. Introduce OpenMP parallel for or similar constructs around these regions while keeping the MPI interface unchanged.

Third, decide on MPI thread support. If only the master thread will call MPI, request MPI_THREAD_FUNNELED and make sure you use #pragma omp master or #pragma omp single around MPI calls.

Fourth, tune the number of MPI processes per node and OpenMP threads per process. Use the job scheduler options to test different combinations, and observe performance and memory usage.

Finally, adjust scheduling, affinity, and possible NUMA related initialization. Make sure that data is initialized in the same parallel pattern that it will be used and that threads and processes are bound to appropriate cores.

By following this process you can incrementally move from a simple single model parallel program to a hybrid MPI OpenMP program that uses both node level and cluster level resources more efficiently.

Comments

Please login to add a comment.

Don't have an account? Register now!