9 Hybrid Parallel Programming

Table of Contents

Motivation: Why Hybrid Parallel Programming?

Hybrid parallel programming combines multiple parallel models—most commonly MPI for distributed-memory parallelism and OpenMP or GPU programming models for shared-memory or accelerator parallelism—within a single application.

In modern HPC systems, each node typically has:

Multiple NUMA regions (sockets) and many cores per socket
Vector units on each core
Often one or more GPUs or other accelerators
A high-performance network connecting nodes

Using only a single parallel model (e.g., MPI everywhere) can leave parts of this hardware underutilized or create bottlenecks. Hybrid programming aims to:

Exploit multiple levels of parallelism

Across nodes (distributed memory, often via MPI)
Within nodes (shared memory, often via OpenMP or threads)
Within cores (vectorization, handled mostly by the compiler and libraries)
On accelerators (CUDA, OpenACC, etc., when combined)

Reduce communication overhead

Fewer MPI processes per node can mean fewer inter-process messages and smaller MPI metadata overhead.
Intra-node work sharing can happen via threads or GPU kernels instead of explicit message passing.

Better memory usage

Threads or GPU kernels on a node can share memory and data structures, potentially reducing duplication compared to a pure-MPI approach.
Fewer MPI ranks may reduce memory used per rank (buffers, halo regions, ghost cells, etc.).

Adapt to complex hardware topologies

Hybrid models can respect sockets/NUMA boundaries, core counts, and GPU placement more naturally than a single flat model.

Hybrid parallel programming is especially common in:

Large-scale PDE solvers and CFD codes
Climate and weather models
Molecular dynamics
Linear algebra and eigenvalue solvers
Applications designed to run on leadership-class supercomputers

Common Hybrid Combinations

The most common hybrid combinations in practice include:

MPI + OpenMP

MPI between nodes, OpenMP within each node (or within each socket).
Often considered the “standard” hybrid model on CPU-only clusters.

MPI + CUDA / OpenACC / HIP

MPI between nodes and between GPUs, with each MPI process controlling one or more GPUs.
GPU kernels handle the fine-grained parallelism.

MPI + OpenMP + GPU

A three-level hybrid: MPI across nodes, OpenMP across CPU cores, and GPUs for offload from each CPU process or thread.
Common in codes incrementally ported to GPUs.

Hybrid with task-based runtimes

MPI combined with tasking frameworks (e.g., OpenMP tasks, Kokkos, Charm++, HPX, or other runtimes).
More advanced and less common for beginners, but increasingly relevant.

Design Considerations for Hybrid Models

Designing a hybrid application requires architectural decisions that do not appear in single-model codes. Some key aspects:

Choosing MPI Process and Thread Counts

Typical strategies on a node with $N_{\text{cores}}$:

One MPI process per node

Use OpenMP threads across all cores on the node.
Pros: minimal MPI ranks, potentially simpler MPI patterns.
Cons: may suffer from NUMA effects and contention if not carefully tuned.

One MPI process per socket

Each MPI process spawns OpenMP threads bound to the cores in that socket.
Often a good compromise between locality and simplicity.

One MPI process per NUMA region

More fine-grained than per-socket if the node has multiple NUMA domains per socket.
Helps preserve memory locality.

One MPI process per GPU

Common in GPU-based systems: each rank drives one GPU, optionally with threads per rank.

Selecting the right balance is a performance-tuning decision and can vary per machine and application.

Work Decomposition Across Levels

In a hybrid code, you must decide what kind of workload each level of parallelism handles:

MPI level (coarse-grained, distributed memory)

Typically handles domain decomposition over large units (e.g., blocks of a grid, subsets of particles, matrix blocks).
Communicates halo regions, global reductions, and global data distributions.

Thread level (shared memory)

Handles loop parallelization, local data processing, and fine-grained tasks inside each MPI subdomain.
Helps hide latency of MPI communications via overlap (if designed carefully).

GPU/accelerator level

Offloads compute-intensive kernels where high throughput is beneficial.
Often used inside each MPI rank as a “sub-accelerator” for local work.

Multi-level decomposition should minimize:

Load imbalance between MPI ranks
Load imbalance between threads or GPU kernels within a rank
Redundant computation or communication

Programming Models for Hybrid CPU-Only Codes

Although many variations are possible, the most established hybrid CPU-only pattern is MPI + OpenMP.

Basic MPI + OpenMP Structure

At a high level, an MPI + OpenMP application looks like:

#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    // Initialize local data based on rank
    // ...
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        // Thread-local work here
    }
    // MPI communication that may happen between parallel regions
    // or even inside them if using MPI with threads
    MPI_Finalize();
    return 0;
}

The hybrid aspects to pay special attention to include:

MPI initialization for threading

Use MPI_Init_thread instead of MPI_Init when multiple threads might call MPI.
Request an appropriate thread support level (e.g., MPI_THREAD_FUNNELED or MPI_THREAD_MULTIPLE).

Placement and binding

Use OpenMP environment variables and/or runtime APIs to control how threads are bound to cores.
Align MPI rank placement with NUMA regions/sockets when launching jobs with the scheduler.

Details of MPI and OpenMP themselves belong to other chapters; here, the focus is how they interact and are combined.

MPI Threading Levels

When combining MPI and threads, you must consider whether threads will call MPI routines:

MPI_THREAD_SINGLE

Only one thread exists in the process (no threading).

MPI_THREAD_FUNNELED

Multiple threads may exist, but only the main thread (typically that which called MPI_Init_thread) can call MPI.
Common and relatively cheap to support.

MPI_THREAD_SERIALIZED

Multiple threads may call MPI, but not at the same time. The application is responsible for serializing access.
More flexible, slightly more overhead.

MPI_THREAD_MULTIPLE

Multiple threads may call MPI routines concurrently.
Most flexible but can be highest overhead, depending on implementation.

Typical hybrid designs try to use MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED to avoid the extra overhead of MPI_THREAD_MULTIPLE, unless there is a clear need for fully concurrent MPI calls from multiple threads.

OpenMP Usage Patterns with MPI

Some common patterns of mixing MPI and OpenMP include:

MPI outside, OpenMP inside

Outer loops over time steps or global stages use MPI for communication between subdomains.
Inner loops over local data are parallelized with OpenMP.

MPI communication in master thread only

#pragma omp parallel regions where only one thread (e.g., omp master or omp single) performs MPI calls, while others do compute.
Suitable for MPI_THREAD_FUNNELED.

Overlap of communication and computation

Use nonblocking MPI calls from one thread to initiate communication.
Other threads perform computation on independent data while communication is in progress.
Requires careful design to avoid race conditions and ensure data dependencies are respected.

Programming Models for Hybrid CPU–GPU Codes

Hybrid CPU–GPU codes often combine MPI with a GPU programming model:

MPI + CUDA
MPI + OpenACC
MPI + vendor-specific GPU frameworks (e.g., HIP, SYCL in some setups)

Rank-to-GPU Mapping

A crucial design decision is how MPI ranks map to GPUs:

One MPI rank per GPU (most common)

Each rank calls GPU kernels on its own GPU(s).
Threading is either minimal or used for CPU-side tasks (e.g., OpenMP for host computations).

Multiple ranks per GPU

Less common, can cause contention and complexity.
Sometimes used to match a legacy MPI-only code structure, then port pieces to GPU.

One rank controlling multiple GPUs

Used in some multi-GPU nodes when a single process orchestrates multiple devices.
More complex to program and load-balance.

The mapping choice has implications for:

MPI communicator layouts (e.g., communicators per node, per GPU)
Load balancing and domain decomposition
Use of GPU-direct or GPU-aware MPI features

GPU-Aware MPI

On GPU clusters, many MPI implementations support:

Direct sending and receiving of device memory buffers (GPU-aware MPI)
Peer-to-peer device transfers
NVLink or similar high-bandwidth interconnects

Hybrid designs can:

Use MPI to exchange data directly between GPUs across nodes.
Overlap GPU kernels with MPI communication, often via CUDA streams and nonblocking MPI.

Exact APIs and performance details depend on the specific hardware and MPI implementation, but the hybrid concept is:

Decompose work across ranks (MPI).
Within each rank, keep data resident on the GPU as much as possible.
Minimize transfers between host and device and between GPUs.

Node-Level Parallelism in Hybrid Codes

Node-level parallelism concerns how you exploit all the resources on a single compute node using threads and/or accelerators.

CPU Node-Level Parallelism

On CPU-only nodes, the primary concerns are:

Core utilization

Assign threads so that each hardware core is used effectively.
Avoid oversubscription (more runnable threads than cores without good reason).

NUMA awareness

Ensure threads mostly access memory local to their NUMA region.
Align data placement with the NUMA locality of the MPI rank or thread group.

Synchronization strategy

Use OpenMP constructs that match your workload (e.g., parallel for, sections, tasks).
Minimize synchronization overheads inside hot regions.

Hybrid codes often:

Create one MPI rank per socket or NUMA region.
Spawn an OpenMP team restricted to that locality.
Use first-touch memory allocation strategies to lay out data correctly in memory.

GPU Node-Level Parallelism

On GPU nodes, node-level parallelism includes:

Multiple GPUs per node

Multiple MPI ranks, each bound to a distinct GPU, or a single rank controlling multiple GPUs.
Use CUDA streams or similar mechanisms to overlap transfers and compute.

Interaction between CPUs and GPUs

CPUs handle MPI communications and host-side logic.
GPUs handle compute-intensive kernels.
Combine CPU parallelism (threads) and GPU parallelism (kernels and blocks) carefully to avoid idle periods.

For beginners, a typical stepwise path is:

Start with an MPI-only code.
Introduce GPU offload per MPI rank.
Optionally introduce CPU threading on top if beneficial.

Cluster-Level Parallelism in Hybrid Codes

Cluster-level parallelism is primarily managed via distributed-memory mechanisms (e.g., MPI). In a hybrid code, this level should:

Align with node architecture

Use node-level topology (sockets, GPUs) to guide rank distribution across the cluster.
Request appropriate resources in job scripts (e.g., --nodes, --ntasks-per-node, --gpus-per-node in SLURM).

Use communicators for subgroups

Create MPI communicators that group ranks by node, by socket, by GPU, etc.
Facilitate node-local operations (e.g., shared memory windows, reduction operations) separate from inter-node ones.

Balance global workload

Ensure each node’s total workload (across all its ranks and threads/GPUs) is approximately equal.
Avoid hot spots where some nodes are overloaded while others are underutilized.

Hybrid designs let you craft different strategies for:

Inter-node communication patterns (e.g., halo exchange only between neighbors).
Intra-node work/failure domains (e.g., isolating failures or stragglers).

Common Hybrid Programming Patterns

Several recurring design patterns appear across many hybrid applications. Recognizing them helps both in reading existing codes and designing new ones.

Pattern 1: MPI Domains + OpenMP Loop Parallelism

Idea: Each MPI rank owns a large subdomain; within that subdomain, OpenMP parallelizes inner loops.

Typical in finite-difference or finite-volume PDE codes.
MPI handles halo exchange at the boundaries of subdomains.
OpenMP uses parallel for directives over spatial or temporal loops.

Characteristics:

Easy to retrofit into existing MPI codes.
Works well if each subdomain is large enough to keep threads busy.
Thread synchronization overhead is minimal when loops are long and regular.

Pattern 2: MPI with Thread-Parallel Tasks

Idea: MPI for domain decomposition; OpenMP or another threading library to handle irregular or task-based parallelism within each rank.

Suitable for adaptive mesh refinement (AMR), tree codes, or irregular graphs.
Use OpenMP tasks or similar to distribute work over uneven data structures.

Characteristics:

More complex to debug and tune.
Can yield better load balance inside each rank than simple loop parallelism.

Pattern 3: MPI Rank per GPU + GPU Kernels

Idea: One MPI rank per GPU, each managing a subset of the global data, offloaded to the device.

MPI used for halo exchanges and global reductions.
Most computational kernels run on the GPU; the CPU runs MPI and host control.
Sometimes combined with OpenMP on the host to overlap tasks.

Characteristics:

Common in modern GPU-accelerated simulations.
Data movement between host and device must be carefully managed.
Often uses GPU-aware MPI to minimize overhead.

Pattern 4: Master–Worker Within a Node

Idea: A node-level “master” thread performs MPI communication and management, while worker threads handle computation.

Aligns with MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED.
Can be implemented via OpenMP’s single or explicit threading.

Characteristics:

Simplifies MPI usage by funneling it through a single control point.
Facilitates communication–computation overlap if workers continue compute while the master manages messages.

Pattern 5: Hierarchical Decomposition

Idea: Parallelism is split over multiple hierarchical levels:

MPI over nodes
OpenMP across sockets within a node
Vectorization inside loops
Optional GPU-level parallelism

Characteristics:

Matches hardware hierarchy closely.
Potentially complex, but can offer strong scalability and efficiency.

Challenges and Pitfalls in Hybrid Programming

Hybrid approaches bring additional complexity beyond single-model parallelism. Some typical issues:

Increased Complexity and Maintenance Cost

More APIs to understand and use correctly.
Harder debugging: errors can arise from interactions between MPI and threads or GPUs.
Longer development and testing time.

For many applications, a simple model (e.g., pure MPI or MPI + GPU) might be sufficient, and hybrid complexity must be justified by real performance gains.

Load Imbalance Across Levels

Even if MPI ranks are balanced, threads inside a rank might not be.
GPU kernels must also be balanced to avoid idle devices.
Amdahl’s Law still applies: un-parallelized or poorly balanced parts can dominate runtime.

Hybrid codes require performance analysis at multiple levels:

Between nodes
Within nodes
Within devices

NUMA and Memory Locality Issues

Hybrid CPU codes can suffer from:

Threads accessing memory on remote NUMA nodes.
Inappropriate binding of MPI ranks and threads to cores.

Symptoms include:

Good scaling up to a few cores, then flattening or slowdowns.
Performance very sensitive to OMP_PROC_BIND, numactl, or job scheduler binding options.

Thread-Safety and Race Conditions with MPI

When multiple threads may interact with MPI:

Misuse of MPI_THREAD_MULTIPLE or incorrect assumptions about MPI thread safety can lead to race conditions.
Some MPI libraries may have limited or costly support for high thread levels.

Practical advice:

Restrict MPI calls to one thread when possible (FUNNELED pattern).
If multiple threads must call MPI, ensure rigorous synchronization.

Debugging and Profiling Complexity

Tools may handle MPI and threads separately; combining them can require specialized MPI+OpenMP/GPU profilers.
Timestamps and event ordering across ranks and threads can become difficult to interpret.

A structured approach to performance analysis—profiling at each level separately, then combined—is essential.

When (and When Not) to Use Hybrid Programming

Hybrid parallel programming is not always the right choice. Situations where it makes sense include:

Memory constraints

Pure MPI would require too many ranks per node, exceeding memory or hitting OS limits (e.g., file handles, MPI buffers).

Scaling limits with pure MPI

Adding more MPI ranks no longer improves performance, but many cores are idle.
Threading within each rank can help exploit additional cores.

Need for GPU acceleration

Modern leadership systems use GPUs or other accelerators; hybrid models are often required to use them effectively.

Clear hierarchical structure in the algorithm

Natural decomposition into coarse tasks for MPI and fine-grained loops or tasks for OpenMP or GPU kernels.

Hybrid programming may not be necessary if:

The problem size is small; single-node or single-model parallelism is enough.
The code is simple and must remain very easy to maintain.
Effective performance can be achieved with a simpler pattern (e.g., pure MPI+GPU with no CPU threading).

Practical Getting-Started Strategy

For absolute beginners, a practical roadmap to hybrid codes might be:

Start with a correct, reasonably efficient serial code.
Add MPI for domain decomposition across multiple nodes.
Introduce OpenMP inside each MPI rank to parallelize the most expensive loops or tasks.
Measure performance at each step to confirm actual benefits.
(Optional) Add GPU support:

Offload critical kernels while keeping MPI+OpenMP structure on the host.
Gradually move more computation to the GPU.

Iterate on placement and scaling

Experiment with different rank/thread configurations, affinities, and resource requests in job scripts.
Evaluate strong and weak scaling behavior.

Focusing on a small set of clear hybrid patterns and gradually refining them is typically more productive than trying to apply every available technique at once.

9.1 Motivation for hybrid programming

9.2 Combining MPI and OpenMP

9.3 Node-level parallelism

9.4 Cluster-level parallelism

9.5 Common hybrid programming patterns

Comments

Please login to add a comment.

Don't have an account? Register now!

9 Hybrid Parallel Programming

Motivation: Why Hybrid Parallel Programming?

Common Hybrid Combinations

Design Considerations for Hybrid Models

Choosing MPI Process and Thread Counts

Work Decomposition Across Levels

Programming Models for Hybrid CPU-Only Codes

Basic MPI + OpenMP Structure

MPI Threading Levels

OpenMP Usage Patterns with MPI

Programming Models for Hybrid CPU–GPU Codes

Rank-to-GPU Mapping

GPU-Aware MPI

Node-Level Parallelism in Hybrid Codes

CPU Node-Level Parallelism

GPU Node-Level Parallelism

Cluster-Level Parallelism in Hybrid Codes

Common Hybrid Programming Patterns

Pattern 1: MPI Domains + OpenMP Loop Parallelism

Pattern 2: MPI with Thread-Parallel Tasks

Pattern 3: MPI Rank per GPU + GPU Kernels

Pattern 4: Master–Worker Within a Node

Pattern 5: Hierarchical Decomposition

Challenges and Pitfalls in Hybrid Programming

Increased Complexity and Maintenance Cost

Load Imbalance Across Levels

NUMA and Memory Locality Issues

Thread-Safety and Race Conditions with MPI

Debugging and Profiling Complexity

When (and When Not) to Use Hybrid Programming

Practical Getting-Started Strategy

Comments

Where to Move