Table of Contents
Why Combine MPI and OpenMP?
Hybrid programming means using MPI between nodes and OpenMP within nodes (or sockets). The motivation was introduced in the parent chapter; here we focus on how the combination actually works in practice and what is specific to running both together.
Typical goals:
- Use cores/threads inside a node efficiently (OpenMP).
- Use multiple nodes across the cluster (MPI).
- Reduce MPI communication overhead and memory footprint.
- Adapt to different node architectures (core counts, NUMA domains).
A common mental model:
- MPI: “which node / process?”
- OpenMP: “which core / thread inside that process?”
Typical Hybrid Execution Models
One MPI Process per Node
- Layout: 1 MPI process on each node, many OpenMP threads inside it.
- Pros:
- Few MPI ranks → less MPI metadata, fewer messages.
- Potentially simpler communication pattern (one rank per node).
- Cons:
- One process must handle all NUMA domains, sockets, GPUs, etc.
- Thread placement and memory locality are critical and can be tricky.
- Some MPI implementations scale worse with very large thread counts.
This model is often used for memory-bound codes where data is easily shared by all threads on a node.
One MPI Process per Socket (or NUMA Domain)
- Layout: 2 MPI processes on a dual-socket node, each with its own subset of cores/threads.
- Pros:
- Better NUMA locality: each MPI process primarily uses its local memory.
- Fewer threads per process → less OpenMP scheduling overhead.
- Often a good balance between memory locality and MPI rank count.
- Cons:
- More MPI ranks than “one per node,” so more collectives/messages.
This is a very common and robust baseline configuration.
One MPI Process per Core (or Hardware Thread)
- Layout: MPI-only or nearly MPI-only: 1 process per core, with few or no OpenMP threads.
- In a hybrid context: sometimes used as a starting point, then selected parts of the code are threaded.
- Pros:
- Simple MPI mental model.
- Cons:
- No benefit from hybrid; not really “using” OpenMP.
- Potentially huge MPI communicator sizes and communication overhead.
Hybrid designs usually move away from this extreme unless the code is hard to thread.
Choosing a Model
Key factors:
- Node architecture: number of sockets, cores per socket, NUMA layout, SMT/hyperthreads.
- Code characteristics: communication pattern, memory bandwidth needs, thread scalability.
- MPI implementation: overhead with many ranks vs many threads.
You typically experiment with:
n_ranks_per_node × n_threads_per_rank = total_cores_per_node
and measure performance.
Basic Structure of a Hybrid MPI + OpenMP Program
A typical C / C++ skeleton (Fortran is analogous):
#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
// Always check the threading level actually provided
if (provided < MPI_THREAD_FUNNELED) {
// handle error or fall back to another mode
}
int world_rank, world_size;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Optionally set number of threads (else use OMP_NUM_THREADS)
// omp_set_num_threads(4);
// Serial + MPI region
// ...
// Parallel region with OpenMP
#pragma omp parallel
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
// Example: only one thread calls MPI
#pragma omp master
{
// MPI call from the master thread
// MPI_Send / MPI_Recv / MPI_Bcast / ...
}
// OpenMP-only computation here
// ...
}
MPI_Finalize();
return 0;
}Key points that are specific to hybrid programming:
MPI_Init_threadinstead ofMPI_Init.- Use of OpenMP constructs inside an MPI program.
- Clear decisions about which threads are allowed to call MPI.
MPI Threading Levels and Their Impact
When combining MPI and OpenMP, you must tell MPI what level of thread support you need. This is done via:
$$
\texttt{MPI\_Init\_thread(argc, argv, required, provided)}
$$
The required and provided values are integers representing:
MPI_THREAD_SINGLE- Only one thread exists; no OpenMP (or OpenMP only with 1 thread).
- Not hybrid in practice.
MPI_THREAD_FUNNELED- Multiple threads may exist, but only the thread that called
MPI_Init_threadis allowed to make MPI calls. - Often the master thread (OpenMP thread 0).
- Lower overhead than full multithreading; often sufficient.
MPI_THREAD_SERIALIZED- Multiple threads may call MPI, but not at the same time. Program must ensure calls are serialized, e.g., via
#pragma omp critical. - More flexible than
FUNNELED, still less demanding thanMULTIPLE. MPI_THREAD_MULTIPLE- Any thread can call MPI at any time.
- Most flexible but may incur higher overhead and complexity in the MPI library.
Hybrid codes often choose the lowest level that satisfies their needs, to minimize MPI overhead. Many applications work well with MPI_THREAD_FUNNELED:
- Master thread: communication + some work.
- Worker threads: computation only.
Basic Collaboration Pattern: MPI Outside, OpenMP Inside
A very common hybrid pattern:
- MPI decides global data decomposition (which process works on which subset).
- Each process works on its local subset using OpenMP to spread work across threads.
- MPI communicates boundary data, global reductions, etc.
Conceptually:
- Between nodes: MPI handles domain decomposition and halo exchanges.
- Within a node: OpenMP parallelizes loops over local data.
Simple example structure (pseudocode):
MPI_Init_thread(...);
// Decide local domain, based on rank and world_size
for (time_step = 0; time_step < T; ++time_step) {
// COMPUTE: thread-parallel region
#pragma omp parallel for
for (i = local_start; i < local_end; ++i) {
// update local data[i]
}
// COMMUNICATE: halo exchange with neighboring ranks
MPI_Sendrecv(...);
// or MPI_Irecv / MPI_Isend + MPI_Wait...
}
MPI_Finalize();
MPI is mostly used in serial sections; OpenMP is used inside compute-heavy regions. This pattern aligns well with MPI_THREAD_FUNNELED.
Managing MPI Calls in OpenMP Regions
When OpenMP parallel regions and MPI calls must coexist, you need to organize who talks to MPI.
Master-Only Communication (FUNNELED)
Used with MPI_THREAD_FUNNELED:
#pragma omp parallel
{
// Master thread handles MPI
#pragma omp master
{
MPI_Recv(...);
MPI_Send(...);
}
// Barrier can be implicit or explicit depending on needs
#pragma omp barrier
// All threads use the newly received data
#pragma omp for
for (int i = 0; i < N; ++i) {
// compute with data
}
}- Only thread 0 (or whichever entered
MPI_Init_thread) performs MPI. - Requires synchronization so that all threads see consistent data.
Serialized MPI Calls (SERIALIZED)
Used with MPI_THREAD_SERIALIZED:
#pragma omp parallel
{
// Some threads might need to communicate
if (need_to_communicate) {
#pragma omp critical(mpi_comm)
{
MPI_Send(...);
MPI_Recv(...);
}
}
// Other work, possibly in parallel
}- More flexible but introduces
criticalregions that can become bottlenecks. - Good when occasional MPI calls from different threads are convenient.
Fully Multithreaded Communication (MULTIPLE)
Used with MPI_THREAD_MULTIPLE:
MPI_Init_thread(..., MPI_THREAD_MULTIPLE, &provided);
#pragma omp parallel
{
// In principle, any thread can call MPI without special restrictions
if (condition) {
MPI_Isend(...);
MPI_Irecv(...);
}
}- Requires MPI implementation with good
MPI_THREAD_MULTIPLEsupport. - Typically used only when there is a clear need, e.g., overlap of communication and computation in complex patterns.
Process and Thread Affinity in Hybrid Runs
Hybrid performance is very sensitive to where processes and threads run.
Key aspects:
- MPI rank placement:
- Often one rank per socket or per NUMA domain.
- Use job scheduler options or MPI launcher flags to pin ranks to specific cores or sockets.
- OpenMP thread placement:
- Control with environment variables like
OMP_PROC_BIND,OMP_PLACES, and the number of threads. - Ensure threads of a rank stay on “its” cores and do not migrate across sockets (to avoid NUMA penalties).
Typical environment configuration:
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=close
export OMP_PLACES=coresLaunch example (SLURM-like, just conceptual):
srun --ntasks-per-node=2 --cpus-per-task=8 ./my_hybrid_app--ntasks-per-node=2: 2 MPI ranks per node (e.g., one per socket).--cpus-per-task=8: each MPI rank gets 8 CPUs → OpenMP uses 8 threads.
This coordination between scheduler, MPI, and OpenMP is central to a successful hybrid configuration.
Memory Usage Considerations in Hybrid Codes
Combining MPI and OpenMP changes memory behavior:
- Fewer MPI ranks → fewer independent copies of large data structures.
- Example: global lookup tables or mesh metadata can be shared by all threads in a process.
- This can significantly reduce memory usage on large-scale runs.
- NUMA locality:
- With multiple threads per process, you must ensure threads allocate and access memory close to their local NUMA node.
- This can be influenced by:
- First-touch allocation behavior (which thread first writes to a region).
- Thread affinity and work partitioning between sockets.
- Per-rank buffers and MPI overhead:
- MPI often allocates internal buffers per rank.
- Reducing rank count can reduce MPI memory overhead, at the cost of more shared-memory contention within each rank.
Hybrid designs often arise from memory pressure: MPI-only runs may not fit in memory when the number of ranks is large.
Algorithm and Decomposition Choices in a Hybrid Setting
Combining MPI and OpenMP can change how you decompose the problem:
- Coarser MPI domains:
- With threads handling fine-grained parallelism, MPI domains can be larger.
- This may:
- Reduce surface-to-volume ratio → fewer halo cells or less boundary communication.
- Change load-balancing strategies: MPI balancing is done on larger chunks.
- Intra-domain parallelism:
- Within each domain, OpenMP parallelizes loops, tiles, or tasks.
- This reduces MPI-level granularity and can simplify communication patterns.
- Hybrid-aware load balancing:
- You might balance work not only per rank but also considering how many threads each rank has.
- Some ranks (e.g., those on nodes with different performance characteristics) may have different thread counts.
Common Pitfalls Specific to Combining MPI and OpenMP
Oversubscribing Cores
- Running
n_ranks_per_node × threads_per_rank > physical_cores_per_nodeleads to oversubscription. - Symptoms:
- Lower performance than expected.
- High context-switch overhead.
- Prevention:
- Plan total concurrency carefully with scheduler and environment variables.
Inconsistent Threading Configuration
- Using different
OMP_NUM_THREADSin different runs without adjusting--cpus-per-task. - Forgetting to set thread affinity leads to threads wandering across cores and sockets.
Incorrect MPI Thread Level Assumptions
- Requesting
MPI_THREAD_MULTIPLEbut only gettingMPI_THREAD_SERIALIZEDand still allowing multiple threads to call MPI → undefined behavior. - Solution:
- Always check
provided. - Adjust program logic if
provided < required.
Hidden Global Synchronizations
- Mixing OpenMP barriers, MPI collectives, and point-to-point calls can create unexpected global synchronization patterns.
- Example:
- Within an OpenMP parallel region, all threads wait at a barrier before master performs an MPI collective, which then waits for all ranks. This can amplify imbalance.
Debugging Complexity
- Bugs can stem from both MPI-level and OpenMP-level issues:
- Data races on shared data plus mismatched MPI messages.
- Strategy:
- First debug MPI-only and OpenMP-only versions where possible.
- Introduce hybridization step-by-step.
Incremental Path to Hybridization
For existing MPI codes:
- Identify compute-heavy loops inside each MPI process.
- Add OpenMP pragmas to those loops, keeping MPI calls outside of parallel regions.
- Start with
MPI_THREAD_FUNNELEDand a small number of threads. - Set a simple rank layout (e.g., 1 rank per socket) and appropriate
OMP_NUM_THREADS. - Measure performance and memory usage; adjust rank/thread configuration.
- Only if needed, experiment with more advanced patterns (e.g., overlapping communication and computation,
MPI_THREAD_MULTIPLE, or more complicated OpenMP constructs).
For existing OpenMP codes:
- Add MPI-based domain decomposition at the outer level.
- Keep most OpenMP parallelism inside the per-domain computations.
- Start with small numbers of ranks and nodes, then scale out.
Summary of Key Design Decisions in Hybrid MPI + OpenMP
When combining MPI and OpenMP, you must explicitly decide:
- MPI rank layout per node (per node, per socket, per NUMA domain).
- Number of threads per rank and their placement.
- MPI threading level (
FUNNELED,SERIALIZED, orMULTIPLE) and how MPI calls are organized relative to OpenMP regions. - Decomposition strategy: which parallelism is expressed at MPI level vs OpenMP level.
- Performance tuning knobs: rank/thread count, affinity, scheduling, and communication patterns.
These decisions differentiate a simple “MPI plus threads” program from a well-designed hybrid HPC application.