Kahibaro
Discord Login Register

Hybrid Parallel Programming

Motivation: Why Hybrid Parallel Programming?

Hybrid parallel programming combines multiple parallel models—most commonly MPI for distributed-memory parallelism and OpenMP or GPU programming models for shared-memory or accelerator parallelism—within a single application.

In modern HPC systems, each node typically has:

Using only a single parallel model (e.g., MPI everywhere) can leave parts of this hardware underutilized or create bottlenecks. Hybrid programming aims to:

Hybrid parallel programming is especially common in:

Common Hybrid Combinations

The most common hybrid combinations in practice include:

Design Considerations for Hybrid Models

Designing a hybrid application requires architectural decisions that do not appear in single-model codes. Some key aspects:

Choosing MPI Process and Thread Counts

Typical strategies on a node with $N_{\text{cores}}$:

Selecting the right balance is a performance-tuning decision and can vary per machine and application.

Work Decomposition Across Levels

In a hybrid code, you must decide what kind of workload each level of parallelism handles:

Multi-level decomposition should minimize:

Programming Models for Hybrid CPU-Only Codes

Although many variations are possible, the most established hybrid CPU-only pattern is MPI + OpenMP.

Basic MPI + OpenMP Structure

At a high level, an MPI + OpenMP application looks like:

#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    // Initialize local data based on rank
    // ...
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        // Thread-local work here
    }
    // MPI communication that may happen between parallel regions
    // or even inside them if using MPI with threads
    MPI_Finalize();
    return 0;
}

The hybrid aspects to pay special attention to include:

Details of MPI and OpenMP themselves belong to other chapters; here, the focus is how they interact and are combined.

MPI Threading Levels

When combining MPI and threads, you must consider whether threads will call MPI routines:

Typical hybrid designs try to use MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED to avoid the extra overhead of MPI_THREAD_MULTIPLE, unless there is a clear need for fully concurrent MPI calls from multiple threads.

OpenMP Usage Patterns with MPI

Some common patterns of mixing MPI and OpenMP include:

Programming Models for Hybrid CPU–GPU Codes

Hybrid CPU–GPU codes often combine MPI with a GPU programming model:

Rank-to-GPU Mapping

A crucial design decision is how MPI ranks map to GPUs:

The mapping choice has implications for:

GPU-Aware MPI

On GPU clusters, many MPI implementations support:

Hybrid designs can:

Exact APIs and performance details depend on the specific hardware and MPI implementation, but the hybrid concept is:

Node-Level Parallelism in Hybrid Codes

Node-level parallelism concerns how you exploit all the resources on a single compute node using threads and/or accelerators.

CPU Node-Level Parallelism

On CPU-only nodes, the primary concerns are:

Hybrid codes often:

GPU Node-Level Parallelism

On GPU nodes, node-level parallelism includes:

For beginners, a typical stepwise path is:

  1. Start with an MPI-only code.
  2. Introduce GPU offload per MPI rank.
  3. Optionally introduce CPU threading on top if beneficial.

Cluster-Level Parallelism in Hybrid Codes

Cluster-level parallelism is primarily managed via distributed-memory mechanisms (e.g., MPI). In a hybrid code, this level should:

Hybrid designs let you craft different strategies for:

Common Hybrid Programming Patterns

Several recurring design patterns appear across many hybrid applications. Recognizing them helps both in reading existing codes and designing new ones.

Pattern 1: MPI Domains + OpenMP Loop Parallelism

Idea: Each MPI rank owns a large subdomain; within that subdomain, OpenMP parallelizes inner loops.

Characteristics:

Pattern 2: MPI with Thread-Parallel Tasks

Idea: MPI for domain decomposition; OpenMP or another threading library to handle irregular or task-based parallelism within each rank.

Characteristics:

Pattern 3: MPI Rank per GPU + GPU Kernels

Idea: One MPI rank per GPU, each managing a subset of the global data, offloaded to the device.

Characteristics:

Pattern 4: Master–Worker Within a Node

Idea: A node-level “master” thread performs MPI communication and management, while worker threads handle computation.

Characteristics:

Pattern 5: Hierarchical Decomposition

Idea: Parallelism is split over multiple hierarchical levels:

Characteristics:

Challenges and Pitfalls in Hybrid Programming

Hybrid approaches bring additional complexity beyond single-model parallelism. Some typical issues:

Increased Complexity and Maintenance Cost

For many applications, a simple model (e.g., pure MPI or MPI + GPU) might be sufficient, and hybrid complexity must be justified by real performance gains.

Load Imbalance Across Levels

Hybrid codes require performance analysis at multiple levels:

NUMA and Memory Locality Issues

Hybrid CPU codes can suffer from:

Symptoms include:

Thread-Safety and Race Conditions with MPI

When multiple threads may interact with MPI:

Practical advice:

Debugging and Profiling Complexity

A structured approach to performance analysis—profiling at each level separately, then combined—is essential.

When (and When Not) to Use Hybrid Programming

Hybrid parallel programming is not always the right choice. Situations where it makes sense include:

Hybrid programming may not be necessary if:

Practical Getting-Started Strategy

For absolute beginners, a practical roadmap to hybrid codes might be:

  1. Start with a correct, reasonably efficient serial code.
  2. Add MPI for domain decomposition across multiple nodes.
  3. Introduce OpenMP inside each MPI rank to parallelize the most expensive loops or tasks.
  4. Measure performance at each step to confirm actual benefits.
  5. (Optional) Add GPU support:
    • Offload critical kernels while keeping MPI+OpenMP structure on the host.
    • Gradually move more computation to the GPU.
  6. Iterate on placement and scaling
    • Experiment with different rank/thread configurations, affinities, and resource requests in job scripts.
    • Evaluate strong and weak scaling behavior.

Focusing on a small set of clear hybrid patterns and gradually refining them is typically more productive than trying to apply every available technique at once.

Views: 21

Comments

Please login to add a comment.

Don't have an account? Register now!