Table of Contents
Why Hybrid Programming Exists
Hybrid programming means using more than one parallel model at the same time, most commonly:
- MPI between nodes (distributed memory)
- OpenMP or threads within a node (shared memory)
- And possibly GPUs/accelerators as a further layer
The motivation for hybrid programming comes from how modern HPC systems are built and from the limitations of using only a single model such as “pure MPI” or “only threads”.
This chapter focuses on why you might want hybrid programming, not how to write it.
Modern Hardware Drives Hybrid Designs
Multi-core, many-core, and NUMA nodes
Typical HPC nodes today:
- Contain many CPU cores (dozens or more)
- Often have multiple CPU sockets per node
- Have non-uniform memory access (NUMA): different sockets have “closer” memory
- May include one or more GPUs or other accelerators
If you used only MPI:
- You could run one MPI process per core
- You would ignore the fact that cores share memory and cache
- You might pay high MPI overhead inside a node
If you used only shared-memory threads:
- You would be confined to a single node (or a small shared-memory system)
- Scaling to thousands of cores across nodes would be hard or impossible
Hybrid programming aligns with the hardware:
- MPI handles communication across nodes
- Threads (and possibly GPUs) exploit the shared-memory and many-core nature inside each node
Increasing core counts per node
Core counts per node are increasing faster than node counts in many systems. That means:
- Communication between nodes is relatively costly
- Communication within a node (via shared memory) is comparatively cheap
Hybrid models aim to:
- Minimize inter-node communication (MPI)
- Maximize intra-node parallelism using threads or accelerators
Limitations of Pure MPI
Using only MPI for everything is simple conceptually, but it has drawbacks on modern clusters.
Memory footprint per process
Each MPI process typically has its own:
- Copy of large data structures (e.g. lookup tables, meshes, matrices)
- MPI-related buffers and metadata
On a node with many cores, “one MPI process per core” can lead to:
- Large total memory per node
- Hitting memory limits sooner
- Reduced cache effectiveness because data is duplicated
Hybrid motivation:
- Use fewer MPI processes per node (e.g. one per NUMA domain or one per socket)
- Use threads within each process to use all cores
- Share large data structures between threads rather than duplicating them per process
Result: lower memory use and potentially better cache behavior.
MPI communication overhead and scaling limits
With pure MPI:
- The number of MPI processes equals the total number of cores used
- Global operations (collectives) involve all processes
- The number of communication partners and messages can become very large
As process count grows:
- Latency and overhead in collectives (like
MPI_Allreduce) can dominate - Point-to-point messages become more numerous
- MPI bookkeeping and progress costs increase
Hybrid motivation:
- Reduce the number of MPI ranks by using threads inside each rank
- Fewer ranks → fewer endpoints in collective operations
- Less communication overhead per node
- Potentially better strong and weak scaling at large core counts
Load balancing and domain decomposition flexibility
In pure MPI:
- Problem domain is decomposed into many small pieces, one per process
- Very fine domain decomposition can cause:
- High surface-to-volume ratio (more communication relative to computation)
- Rigid domains that are hard to balance across many processes
With hybrid:
- Decompose into fewer, larger MPI subdomains
- Use threads within each subdomain to distribute work over cores
- Sometimes easier to:
- Adjust workload within a node/thread team
- Use dynamic scheduling in OpenMP for irregular work
- Reduce communication surfaces between MPI subdomains
Limitations of Pure Shared-Memory Programming
Using only OpenMP or threads:
Node-limited scaling
Shared-memory programming alone:
- Works well on a single node or small SMP machine
- Does not natively support scaling across many nodes
Hybrid motivation:
- Use MPI for inter-node distribution
- Use threads for intra-node work sharing
- Allow the program to scale from:
- Laptop or workstation (just threads)
- Up to large clusters (MPI + threads)
Expressing multi-level parallelism
Many problems have natural hierarchy:
- Top level: large subdomains, ensembles, or parameter sweeps
- Inner level: loops, matrix operations, kernels
Pure shared-memory:
- Can express inner-level parallelism but does not directly structure inter-node work
Hybrid motivation:
- Use MPI to distribute coarse-grained tasks/subdomains across nodes
- Use threads to accelerate fine-grained computational kernels within each MPI rank
Architectural and Performance Motivations
Matching the memory hierarchy
Modern nodes have:
- Multiple cache levels (L1, L2, L3)
- NUMA regions with different access latencies
- Potential high-bandwidth memory (HBM) on some CPUs/GPUs
Hybrid programming can:
- Map MPI ranks to NUMA domains or sockets
- Use threads within each rank to keep data local in caches and local memory
- Reduce remote memory accesses across NUMA domains
Motivation:
- Better locality
- Less memory latency
- More effective use of caches and memory bandwidth
Reducing communication and synchronization costs
Some operations are cheaper with threads than with MPI:
- Thread synchronization is often cheaper than inter-process communication
- Data sharing via memory is faster than sending messages for intra-node data exchange
Hybrid motivation:
- Use shared-memory constructs (e.g., OpenMP reductions) within nodes
- Use MPI only when communication must cross node boundaries
- Replace many small intra-node MPI messages with simple memory accesses and thread barriers
Exploiting accelerators together with CPUs
Systems increasingly combine:
- Multi-core CPUs
- Multiple GPUs or other accelerators per node
A common pattern:
- MPI across nodes and sometimes across GPUs
- OpenMP or other threading on the CPU side
- CUDA/OpenACC/OpenMP offload for GPU kernels
Motivation:
- Utilize all resources:
- CPU cores (thread-level parallelism)
- GPUs (massive data parallelism)
- Multiple nodes (MPI-scale parallelism)
- Coordinate CPU–GPU work at node level while MPI exchanges data between nodes
Practical Motivations in Real Applications
Handling very large problems
For very large simulations or datasets:
- Memory per core is often the limiting factor
- I/O and communication overhead can dominate compute time
Hybrid helps by:
- Lowering per-rank memory needs as discussed earlier
- Organizing I/O at rank level (e.g., one rank per node doing I/O on behalf of its threads)
- Allowing different levels of parallelism for:
- Computation
- I/O
- Pre/post-processing
Simplifying some aspects of code design
A well-chosen hybrid decomposition can:
- Make communication patterns clearer at the MPI level (fewer, larger subdomains)
- Concentrate parallel loop handling inside node-level kernels
- Allow reuse of optimized shared-memory kernels (e.g., OpenMP-parallel library calls) inside an MPI code
Motivation:
- Separate concerns:
- MPI layer handles global distribution and communication
- Thread layer focuses on local performance and loops
- Enable incremental porting:
- Start from an MPI code and add OpenMP in hot kernels
- Or start from a threaded code and wrap it with MPI to go multi-node
Adapting to different systems and batch environments
Clusters differ:
- Some have many cores per node, others fewer
- Some have multiple GPUs per node, others none
- Memory per core varies
Hybrid designs can be more portable in performance:
- You can tune:
- Number of MPI ranks per node
- Number of threads per rank
- Same code can run efficiently on:
- A small-node-count system with many cores per node
- A large-node-count system with fewer cores per node
Job schedulers also often encourage:
- Efficient use of nodes (e.g., using all cores per node)
- Reducing network traffic
Hybrid programming helps you honor these constraints while keeping performance.
Trade-offs and When Hybrid Makes Sense
Hybrid programming is not a free win; it adds complexity:
- Two (or more) programming models to understand and debug
- Potential for new kinds of bugs (e.g., mismatched MPI and threading assumptions)
- Additional tuning parameters (ranks vs. threads vs. GPU work size)
Despite that, the motivation to use hybrid is strong when:
- Nodes have many cores and/or GPUs
- Memory per process is a concern
- Pure MPI scaling starts to saturate at high process counts
- You want a single code to run efficiently across a range of machine sizes
Conversely, hybrid may be less compelling when:
- Running on small systems with few cores per node
- Problem size is modest
- Simplicity is more important than peak performance
Understanding these motivations will help you decide, in later chapters, how to combine MPI, OpenMP, and accelerators in a way that matches your applications and target systems.