Table of Contents
Overview
In hybrid parallel programming you intentionally combine at least two forms of parallelism in a single application. In this course the most common combination is MPI for distributed memory parallelism across nodes, and OpenMP for shared memory parallelism within each node. This chapter focuses on typical structural patterns that appear again and again in real hybrid codes. You will see how these patterns decompose work across nodes and within nodes, and how data is usually moved and synchronized.
The goal here is not to provide full MPI or OpenMP tutorials, but to show how they are composed in practice.
Single-program, multi-level parallelism
The dominant pattern in hybrid HPC applications is a single program that uses MPI between processes and threaded parallelism inside each process. Every MPI process runs the same executable. Each process gets a global rank, and often a rank within its node, and then starts one or more parallel regions with threads.
A very common structure looks like this:
- MPI initializes and discovers the global communicator.
- Each rank determines which node it is on and how many ranks share that node.
- Inside the main computation loop, each rank launches OpenMP parallel regions on the cores of its node share.
- Global communication, such as reductions or halo exchanges, is done by MPI between parallel regions or sometimes inside them with care.
A simplified pattern in C-like pseudocode is:
MPI_Init(...);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
setup_problem();
for (step = 0; step < nsteps; ++step) {
exchange_boundaries_with_MPI();
#pragma omp parallel
{
compute_local_update();
}
if (step % output_interval == 0) {
write_output_with_MPI();
}
}
MPI_Finalize();The key idea is that coarse decomposition across nodes is handled by MPI, and fine-grain work inside those decomposed pieces is handled by threads.
In most hybrid patterns you should avoid oversubscribing CPU cores. The number of MPI processes per node multiplied by the number of threads per process should not exceed the available hardware threads on that node.
Domain decomposition with threaded subdomains
A very common pattern is spatial domain decomposition. The global physical or logical domain is divided into subdomains, each assigned to one MPI rank. Inside each subdomain, OpenMP threads operate on different parts of the local data.
Imagine a two-dimensional grid of size $N_x \times N_y$. A typical hybrid pattern follows these steps.
First, divide the global grid into $P_x \times P_y$ MPI subdomains. Each MPI process gets a block of size $n_x \times n_y$, often including halo or ghost cells at the boundaries. The decomposition across processes is often done with a Cartesian communicator or by computing indices from the rank number.
Second, within each local $n_x \times n_y$ block, use OpenMP to parallelize loops over rows or columns. Each thread processes a subset of the local indices $i, j$.
A typical loop nest pattern is:
for (t = 0; t < timesteps; ++t) {
// Exchange halos for subdomains across MPI processes
halo_exchange(MPI_COMM_WORLD, u_local);
// Update interior with OpenMP
#pragma omp parallel for collapse(2)
for (j = 1; j < ny_local-1; ++j) {
for (i = 1; i < nx_local-1; ++i) {
u_new[j][i] = stencil_update(u_local, i, j);
}
}
swap(u_local, u_new);
}The hybrid pattern separates concerns. MPI handles communication between subdomains, while OpenMP distributes the compute work in each subdomain over the cores of a node.
This domain decomposition pattern appears in finite difference and finite volume solvers, structured grid PDE codes, lattice-based simulations, and many others.
Master-only and thread-safe MPI patterns
Hybrid codes must decide how MPI calls interact with threads. Two recurring patterns are common in practice.
In the master-only pattern, also called "MPI on the outside," only one thread in each MPI process performs all MPI communication. Typically this is the OpenMP master thread or the thread that calls outside of parallel regions. OpenMP parallel regions are placed around purely local computations, and then terminated before MPI calls.
The outer structure then resembles:
MPI_Init_thread(..., MPI_THREAD_FUNNELED, ...);
for (step = 0; step < nsteps; ++step) {
// Only master thread executes this region by construction
halo_exchange_using_MPI(u_local);
#pragma omp parallel
{
compute_step(u_local);
}
}
MPI_Finalize();
In this pattern the MPI thread support level MPI_THREAD_FUNNELED is sufficient, meaning only the thread that initialized MPI (often the master) will call MPI functions. This simplifies thread safety issues but requires careful placement of parallel regions to avoid MPI calls from worker threads.
A different pattern is thread-safe MPI, sometimes called "MPI everywhere." Threads may call MPI routines concurrently, for example in overlapping communication and computation, or when each thread owns part of the data that is communicated. Here MPI_Init_thread usually requests at least MPI_THREAD_MULTIPLE.
A typical pattern might look like:
MPI_Init_thread(..., MPI_THREAD_MULTIPLE, &provided);
#pragma omp parallel
{
int tid = omp_get_thread_num();
// Each thread prepares its own buffer portion
prepare_thread_buffer(tid, buf);
// All threads may participate in MPI communication
MPI_Allreduce(MPI_IN_PLACE, buf, count, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
// Each thread uses the reduced data
use_reduced_result(tid, buf);
}
MPI_Finalize();This "MPI in the threads" pattern allows more flexibility but often increases complexity and can be limited by MPI library performance for thread multiple support. Many production codes stick to master-only MPI and prefer a clear separation between communication phases and threaded compute phases.
Hybrid patterns in stencil and PDE solvers
Structured mesh and stencil-based PDE applications are classical users of hybrid programming. Several distinct hybrid patterns appear in such codes.
A first pattern uses MPI to divide the global grid into relatively coarse blocks, leaving each MPI rank with a substantial local problem. OpenMP then parallelizes the main update loops inside each rank. Domain boundaries are updated with MPI halo exchanges between time steps or iterations. OpenMP is not used to parallelize communications in this pattern.
A second pattern adds threading to communication packing and unpacking. For wide stencils or complex subdomain shapes, packing halo regions into contiguous MPI buffers can be expensive. A hybrid approach can assign different halo faces or segments to different threads. A typical structure is:
- Launch an OpenMP parallel region.
- Use
omp forto parallelize packing of halo data into send buffers. - Use a master-only region to perform MPI sends and receives.
- Use
omp foragain to unpack received halos. - Use
omp forto update the interior points.
This pattern uses threads both for computation and for communication preparation, while still performing MPI calls only in the master thread.
A more advanced pattern overlaps communication and computation with both MPI and OpenMP. For example, one may split a local domain into interior and boundary regions. While MPI exchanges halos for boundary regions, threads compute updates for interior cells. This produces a timeline where communication and computation happen concurrently. The high level steps are:
- Start nonblocking MPI sends and receives for halo regions.
- In an OpenMP parallel region, execute computation on strictly interior cells that do not depend on new halo values.
- Wait for MPI communication to complete.
- In another parallel loop, update boundary cells that depend on halo data.
This pattern requires careful scheduling and correct use of nonblocking MPI, but can hide communication latency and is widely used in performance critical PDE solvers.
Hybrid patterns for sparse linear algebra
Many large scale simulations reduce to one or more sparse linear algebra problems. Here, hybrid patterns focus on the representation of sparse matrices and vectors, and how to share them between processes and threads.
A common pattern uses an MPI process per subdomain or partition of the sparse matrix, and then an OpenMP parallel loop over rows of the local sparse matrix. The data structures often remain essentially the same as in pure MPI codes, such as compressed sparse row (CSR) format, but only one copy of the local matrix is stored per MPI process, and threads share it.
For example, a sparse matrix vector multiply in CSR format might be organized as:
// y = A * x, with A locally stored in CSR on each MPI rank
#pragma omp parallel for
for (i = 0; i < n_local_rows; ++i) {
double sum = 0.0;
for (idx = rowptr[i]; idx < rowptr[i+1]; ++idx) {
int col = colind[idx];
sum += val[idx] * x[col];
}
y[i] = sum;
}
// Then perform MPI communication for halo elements of x or y as needed
exchange_vector_halos_with_MPI(x, y);This "rows by threads, rows by processes" pattern is very common. Each MPI process owns a contiguous set of rows. Each OpenMP thread processes a subset of the local rows. Communication of off-process vector entries is still handled with MPI.
Other hybrid sparse patterns assign different matrix blocks to threads or use task-based OpenMP constructs to represent independent matrix blocks. In modern algebra libraries, one frequently sees a mixture of MPI at the coarsest level and a thread pool or OpenMP inside each process, often complemented by vectorization within each thread. The hybrid pattern becomes three levels of parallelism: MPI across nodes, threads across cores, and SIMD across lanes.
Hybrid patterns for particle and N-body simulations
Particle-based and N-body codes use different hybrid structures compared to regular grids, because the data and work are not naturally laid out on arrays of uniform size.
A common hybrid pattern divides physical space into cells or domains across MPI ranks. Each rank owns the particles within its space region. Inside each rank, threads share the particle arrays and operate on portions of them in parallel.
A typical pattern is:
- MPI assigns a spatial domain to each process and handles particle migration across domain boundaries.
- OpenMP parallelizes loops over local particles, for example during force calculation or position updates.
- Data structures like neighbor lists or cell lists are sometimes shared between threads and require careful synchronization.
For example, a hybrid time step might look like this:
for (step = 0; step < nsteps; ++step) {
// Build or update domain decomposition with MPI
exchange_particles_across_processes();
// Compute forces in parallel on local particles
#pragma omp parallel for
for (i = 0; i < n_local_particles; ++i) {
forces[i] = compute_force_for_particle(i, ...);
}
// Integrate equations of motion
#pragma omp parallel for
for (i = 0; i < n_local_particles; ++i) {
update_particle_state(i, forces);
}
}Another hybrid pattern in N-body simulations uses threads to handle different interaction lists or tree nodes, while MPI partitions the global tree across processes. For example, a tree-based gravitational code may let each MPI process store a branch of the global tree and assign tree traversal tasks to threads. Threads can independently traverse different parts of the local tree, while MPI is responsible for assembling global multipole information and exchanging partial solutions.
Hybrid particle patterns often need careful consideration of load balancing, because particle distributions can become nonuniform. MPI-based domain decomposition, possibly dynamic, handles coarse balance, while internal work sharing through OpenMP tries to even out local imbalance across cores.
Hybrid master-worker and pipeline patterns
Not all hybrid patterns are domain based. Master-worker and pipeline patterns also appear frequently, especially in multi-physics applications and in applications that couple several solvers.
In a master-worker hybrid pattern, each MPI rank or a subset of them acts as a "global worker," and threads inside each worker perform local sub-tasks. For example, consider a parameter sweep where each MPI process is responsible for a subset of parameter combinations. Each process then spawns multiple OpenMP threads to simulate its assigned parameter set concurrently. The global decomposition is across tasks, and the local decomposition is within each task.
Another master-worker pattern appears when a few MPI processes act as control or I/O ranks. They coordinate workflow, reading input and writing output, while other ranks perform heavy computation. Within each computational rank, threads share the workload. MPI is used between master and workers, and OpenMP distributes workload inside each worker.
Pipeline patterns occur when a simulation has several consecutive stages that are themselves parallel. For example, stage A might preprocess data, stage B might perform a simulation, and stage C might perform analysis. A hybrid pipeline can assign different sets of MPI ranks to different stages, with threads accelerating each stage locally. Data flows between pipeline stages through MPI. Within a stage, OpenMP provides loop or task parallelism. This pattern is less common in classical single-physics solvers but appears in workflows, coupled codes, and some data analysis pipelines.
Hybrid patterns with process and thread affinity
A practical but crucial part of hybrid programming patterns is how processes and threads are mapped to hardware. Many hybrid codes follow a "one MPI process per NUMA domain" pattern, and threads are confined to cores within that domain.
A NUMA domain is a hardware region where memory access is relatively uniform. For example, a dual-socket node can have two NUMA domains. A common hybrid mapping places one MPI process on each socket, and lets each process spawn threads that run only on the cores of that socket. This respects the memory hierarchy and reduces cross-socket memory traffic.
The pattern roughly is:
- Launch the job with a specific number of MPI processes per node, usually equal to the number of sockets or NUMA domains on the node.
- Set the number of threads per process equal to the number of cores in the corresponding domain.
- Use OpenMP environment variables or runtime APIs to pin threads to specific cores.
At code level, the hybrid pattern does not change the loop or domain decomposition structures, but the mapping matters a lot for performance. The common pattern is to pair the MPI rank layout with the node topology and then align thread affinity with the local core layout. Although affinity and placement are discussed more in performance chapters, you should already recognize that most practical hybrid patterns assume a thoughtful and nonrandom mapping of ranks and threads to hardware.
Hybrid patterns for I/O and checkpointing
Many hybrid codes use a mixture of MPI and thread-level I/O strategies. A recurring pattern is "MPI for global I/O, threads for local preprocessing."
For example, an application may use MPI-IO or a high-level library that builds on MPI, such as HDF5 or parallel NetCDF, for coordinated access to shared files. Within each MPI process, OpenMP is used to reorganize data into I/O friendly layouts or to compress and pack buffers before writing.
A typical pattern is:
- Threads in each rank collect and transform local data in parallel and fill a contiguous output buffer.
- The master thread in each rank participates in a collective MPI-IO operation that writes the buffer to disk.
- During restart or postprocessing, MPI reads large chunks of data collectively, and threads unpack or decompress the data in parallel.
In some workflows, I/O ranks are specialized MPI processes that handle all interaction with the file system. Worker ranks send data to these I/O ranks through MPI. Within each I/O rank, OpenMP threads manage buffering, compression, or even local file creation. This hybrid I/O pattern separates computation and I/O responsibilities but still uses threads to make file handling efficient.
Choosing and combining patterns
Real applications often use several hybrid patterns together.
A domain decomposition PDE solver may use MPI for subdomains, OpenMP for loop parallelism inside each subdomain, overlapping MPI communication and local computation, a NUMA aware mapping of processes and threads, and hybrid MPI plus OpenMP based I/O for checkpoints.
Similarly, a multi-physics code might use different hybrid patterns inside each physics component, and then add a pipeline pattern across components.
A simple way to think about hybrid pattern design is:
- Use MPI for the outermost decomposition of the problem into chunks that can live on separate nodes.
- Within each chunk, choose OpenMP patterns that match your data structures and loops, such as parallel for loops for arrays or tasks for irregular work.
- Align your process and thread layout with the node architecture.
- Decide where MPI calls should live, master-only or thread-multiple, and place OpenMP regions accordingly.
- If I/O or coupled workflows are important, consider hybrid I/O and pipeline arrangements.
Although the details vary by application domain, the same basic hybrid patterns appear across many codes. Recognizing them makes it easier to understand existing HPC software and to design your own scalable hybrid applications.