9.1 Motivation for hybrid programming

Why Hybrid Programming Exists

Hybrid programming means using more than one parallel model at the same time, most commonly:

MPI between nodes (distributed memory)
OpenMP or threads within a node (shared memory)
And possibly GPUs/accelerators as a further layer

The motivation for hybrid programming comes from how modern HPC systems are built and from the limitations of using only a single model such as “pure MPI” or “only threads”.

This chapter focuses on why you might want hybrid programming, not how to write it.

Modern Hardware Drives Hybrid Designs

Multi-core, many-core, and NUMA nodes

Typical HPC nodes today:

Contain many CPU cores (dozens or more)
Often have multiple CPU sockets per node
Have non-uniform memory access (NUMA): different sockets have “closer” memory
May include one or more GPUs or other accelerators

If you used only MPI:

You could run one MPI process per core
You would ignore the fact that cores share memory and cache
You might pay high MPI overhead inside a node

If you used only shared-memory threads:

You would be confined to a single node (or a small shared-memory system)
Scaling to thousands of cores across nodes would be hard or impossible

Hybrid programming aligns with the hardware:

MPI handles communication across nodes
Threads (and possibly GPUs) exploit the shared-memory and many-core nature inside each node

Increasing core counts per node

Core counts per node are increasing faster than node counts in many systems. That means:

Communication between nodes is relatively costly
Communication within a node (via shared memory) is comparatively cheap

Hybrid models aim to:

Minimize inter-node communication (MPI)
Maximize intra-node parallelism using threads or accelerators

Limitations of Pure MPI

Using only MPI for everything is simple conceptually, but it has drawbacks on modern clusters.

Memory footprint per process

Each MPI process typically has its own:

Copy of large data structures (e.g. lookup tables, meshes, matrices)
MPI-related buffers and metadata

On a node with many cores, “one MPI process per core” can lead to:

Large total memory per node
Hitting memory limits sooner
Reduced cache effectiveness because data is duplicated

Hybrid motivation:

Use fewer MPI processes per node (e.g. one per NUMA domain or one per socket)
Use threads within each process to use all cores
Share large data structures between threads rather than duplicating them per process

Result: lower memory use and potentially better cache behavior.

MPI communication overhead and scaling limits

With pure MPI:

The number of MPI processes equals the total number of cores used
Global operations (collectives) involve all processes
The number of communication partners and messages can become very large

As process count grows:

Latency and overhead in collectives (like MPI_Allreduce) can dominate
Point-to-point messages become more numerous
MPI bookkeeping and progress costs increase

Hybrid motivation:

Reduce the number of MPI ranks by using threads inside each rank
Fewer ranks → fewer endpoints in collective operations
Less communication overhead per node
Potentially better strong and weak scaling at large core counts

Load balancing and domain decomposition flexibility

In pure MPI:

Problem domain is decomposed into many small pieces, one per process
Very fine domain decomposition can cause:

High surface-to-volume ratio (more communication relative to computation)
Rigid domains that are hard to balance across many processes

With hybrid:

Decompose into fewer, larger MPI subdomains
Use threads within each subdomain to distribute work over cores
Sometimes easier to:

Adjust workload within a node/thread team
Use dynamic scheduling in OpenMP for irregular work
Reduce communication surfaces between MPI subdomains

Limitations of Pure Shared-Memory Programming

Using only OpenMP or threads:

Node-limited scaling

Shared-memory programming alone:

Works well on a single node or small SMP machine
Does not natively support scaling across many nodes

Hybrid motivation:

Use MPI for inter-node distribution
Use threads for intra-node work sharing
Allow the program to scale from:

Laptop or workstation (just threads)
Up to large clusters (MPI + threads)

Expressing multi-level parallelism

Many problems have natural hierarchy:

Top level: large subdomains, ensembles, or parameter sweeps
Inner level: loops, matrix operations, kernels

Pure shared-memory:

Can express inner-level parallelism but does not directly structure inter-node work

Hybrid motivation:

Use MPI to distribute coarse-grained tasks/subdomains across nodes
Use threads to accelerate fine-grained computational kernels within each MPI rank

Architectural and Performance Motivations

Matching the memory hierarchy

Modern nodes have:

Multiple cache levels (L1, L2, L3)
NUMA regions with different access latencies
Potential high-bandwidth memory (HBM) on some CPUs/GPUs

Hybrid programming can:

Map MPI ranks to NUMA domains or sockets
Use threads within each rank to keep data local in caches and local memory
Reduce remote memory accesses across NUMA domains

Motivation:

Better locality
Less memory latency
More effective use of caches and memory bandwidth

Reducing communication and synchronization costs

Some operations are cheaper with threads than with MPI:

Thread synchronization is often cheaper than inter-process communication
Data sharing via memory is faster than sending messages for intra-node data exchange

Hybrid motivation:

Use shared-memory constructs (e.g., OpenMP reductions) within nodes
Use MPI only when communication must cross node boundaries
Replace many small intra-node MPI messages with simple memory accesses and thread barriers

Exploiting accelerators together with CPUs

Systems increasingly combine:

Multi-core CPUs
Multiple GPUs or other accelerators per node

A common pattern:

MPI across nodes and sometimes across GPUs
OpenMP or other threading on the CPU side
CUDA/OpenACC/OpenMP offload for GPU kernels

Motivation:

Utilize all resources:

CPU cores (thread-level parallelism)
GPUs (massive data parallelism)
Multiple nodes (MPI-scale parallelism)

Coordinate CPU–GPU work at node level while MPI exchanges data between nodes

Practical Motivations in Real Applications

Handling very large problems

For very large simulations or datasets:

Memory per core is often the limiting factor
I/O and communication overhead can dominate compute time

Hybrid helps by:

Lowering per-rank memory needs as discussed earlier
Organizing I/O at rank level (e.g., one rank per node doing I/O on behalf of its threads)
Allowing different levels of parallelism for:

Computation
I/O
Pre/post-processing

Simplifying some aspects of code design

A well-chosen hybrid decomposition can:

Make communication patterns clearer at the MPI level (fewer, larger subdomains)
Concentrate parallel loop handling inside node-level kernels
Allow reuse of optimized shared-memory kernels (e.g., OpenMP-parallel library calls) inside an MPI code

Motivation:

Separate concerns:

MPI layer handles global distribution and communication
Thread layer focuses on local performance and loops

Enable incremental porting:

Start from an MPI code and add OpenMP in hot kernels
Or start from a threaded code and wrap it with MPI to go multi-node

Adapting to different systems and batch environments

Clusters differ:

Some have many cores per node, others fewer
Some have multiple GPUs per node, others none
Memory per core varies

Hybrid designs can be more portable in performance:

You can tune:

Number of MPI ranks per node
Number of threads per rank

Same code can run efficiently on:

A small-node-count system with many cores per node
A large-node-count system with fewer cores per node

Job schedulers also often encourage:

Efficient use of nodes (e.g., using all cores per node)
Reducing network traffic

Hybrid programming helps you honor these constraints while keeping performance.

Trade-offs and When Hybrid Makes Sense

Hybrid programming is not a free win; it adds complexity:

Two (or more) programming models to understand and debug
Potential for new kinds of bugs (e.g., mismatched MPI and threading assumptions)
Additional tuning parameters (ranks vs. threads vs. GPU work size)

Despite that, the motivation to use hybrid is strong when:

Nodes have many cores and/or GPUs
Memory per process is a concern
Pure MPI scaling starts to saturate at high process counts
You want a single code to run efficiently across a range of machine sizes

Conversely, hybrid may be less compelling when:

Running on small systems with few cores per node
Problem size is modest
Simplicity is more important than peak performance

Understanding these motivations will help you decide, in later chapters, how to combine MPI, OpenMP, and accelerators in a way that matches your applications and target systems.

Comments

Please login to add a comment.

Don't have an account? Register now!

9.1 Motivation for hybrid programming

Why Hybrid Programming Exists

Modern Hardware Drives Hybrid Designs

Multi-core, many-core, and NUMA nodes

Increasing core counts per node

Limitations of Pure MPI

Memory footprint per process

MPI communication overhead and scaling limits

Load balancing and domain decomposition flexibility

Limitations of Pure Shared-Memory Programming

Node-limited scaling

Expressing multi-level parallelism

Architectural and Performance Motivations

Matching the memory hierarchy

Reducing communication and synchronization costs

Exploiting accelerators together with CPUs

Practical Motivations in Real Applications

Handling very large problems

Simplifying some aspects of code design

Adapting to different systems and batch environments

Trade-offs and When Hybrid Makes Sense

Comments

Where to Move