Distributed-Memory Parallel Programming

Table of Contents

Overview

Distributed-memory parallel programming is the dominant model for scaling applications across many nodes in an HPC cluster. Instead of sharing a single address space, each process has its own private memory, and explicit communication is required to exchange data. This model underlies most large-scale simulations and data-processing codes that run on modern supercomputers.

This chapter focuses on the core ideas, design patterns, and practical implications of distributed-memory programming, independent of any specific library. Concrete APIs (such as MPI) are covered separately.

Single Process vs Many Processes

In distributed-memory programming, work is performed by multiple independent processes:

Each process has:

Its own address space (its own heap, stack, global variables)
Its own copy of the program code
Potentially runs on a different node

Processes cannot directly read or write each other’s memory.
All interaction happens via explicit communication (typically message passing).

Conceptually:

Shared-memory: “many threads, one memory”
Distributed-memory: “many processes, each with its own memory”

Because each process has its own memory, the programmer must decide:

How data is divided between processes
When and how processes exchange information
How to keep logically-related data consistent across processes

Message Passing as a Programming Model

The core abstraction in distributed-memory programming is message passing:

A process sends a message containing data to another process
Another process receives that message
Messages can be:

Point-to-point (one sender, one receiver)
Collective (involving many processes)

Typical operations (in abstract form):

send(destination, data)
recv(source, buffer)
broadcast(root, data)
reduce(operation, data, root)

Distributed-memory libraries (such as MPI) provide robust, portable implementations of these concepts, but the underlying idea remains “cooperate by exchanging messages, not by sharing memory.”

Data Decomposition and Distribution

A central design decision in distributed-memory programs is how to partition data across processes. This is often called decomposition or domain decomposition.

Common patterns:

1D decomposition

Example: Split a large array of length $N$ into $P$ contiguous chunks, one per process.

2D / 3D decomposition

Example: Split a matrix or 3D grid into blocks or tiles; each process owns a sub-block.

Block vs cyclic vs block-cyclic

Block: each process gets one contiguous block.
Cyclic: elements are assigned in a round-robin fashion.
Block-cyclic: chunks of contiguous elements are distributed in a cyclic pattern.

Considerations when choosing a decomposition:

Balance: each process should get roughly equal work.
Locality: processes should mostly work on their local data.
Communication pattern: minimize the amount and distance of communication needed.

The “shape” of the decomposition often mirrors the problem’s structure (e.g., decomposing a 3D physical domain into 3D subdomains).

Communication Patterns

Distributed-memory applications are often defined by their communication patterns. Some common ones:

Nearest-Neighbor (Stencil) Communication

Processes own subregions of a larger grid. To update boundary cells, they exchange data with neighboring subdomains.

Each process sends boundary (“halo” or “ghost”) data to neighbors.
Each process receives boundary data from neighbors.
Used in:

Finite difference / finite volume PDE solvers
Lattice or grid-based simulations

Key properties:

Regular, predictable communication pattern
Communication volume typically scales with surface area of subdomains

Global Collectives

Sometimes all processes must cooperate to combine or redistribute data:

Broadcast: one process shares data with all others.
Scatter: one process distributes different portions of data to all processes.
Gather: one process collects data from all processes.
Allgather / alltoall: all processes both send and receive.
Reductions: combine values from all processes using an operation like sum, max, min.

Global collectives are powerful but can be expensive at very large process counts, so their use and frequency should be considered carefully.

Irregular and Sparse Communication

For irregular data structures (graphs, sparse matrices, particle simulations), communication patterns may not be uniform:

Some processes may need to communicate with many others, some with few.
Communication partners may change over time.
Data volume may be highly unbalanced.

Techniques:

Build communication schedules or patterns once, then reuse.
Use indirection arrays or adjacency lists to describe who sends what to whom.
Consider reordering data or graph partitioning to reduce communication.

Synchronization and Coordination

Because processes run concurrently and independently, programmers must coordinate them to ensure correctness.

Common coordination concepts in distributed-memory:

Point-to-point synchronization

A send/recv pair can implicitly synchronize two processes at that point.

Collective synchronization

Barriers: all processes wait until everyone reaches the same point.
Some collective operations also act as synchronization points.

Issues to be aware of:

Over-synchronization can harm performance (processes spend time waiting).
Under-synchronization can cause:

Deadlocks (processes waiting forever on each other)
Use of uninitialized or incomplete data

The design goal is to synchronize only when necessary, and in a way that avoids circular dependencies.

Latency, Bandwidth, and Granularity

Distributed-memory performance is highly affected by the properties of the interconnect between nodes. Two key metrics:

Latency: time to start a message (even if it’s very small).
Bandwidth: rate at which data can be transferred once the transfer is underway.

These lead to important programming principles:

Avoid many tiny messages:

Latency dominates; overhead per message becomes large.
Instead, aggregate data into fewer, larger messages when possible.

Overlap communication with computation:

While some data is being transferred, do useful work that does not depend on it.
Reduce the visible cost of communication.

Granularity:

Coarse-grained parallelism: large chunks of work per communication.
Fine-grained parallelism: small chunks of work, frequent communication.

Distributed-memory programs usually favor coarse-grained or moderately coarse-grained tasks to amortize communication overhead.

Scalability in Distributed-Memory Programs

Distributed-memory programming is the main route to scaling codes to thousands or millions of cores. However, several factors limit scalability:

Communication overhead:

More processes often mean more communication.

Global synchronization:

Barriers and global collectives become more expensive.

Load imbalance:

Some processes may finish their work earlier and wait for others.

Algorithmic limitations:

Some algorithms inherently require global information or serial work.

Scalability analysis typically involves:

Understanding the algorithm’s communication volume
Identifying global operations and reducing their frequency
Restructuring data decompositions to improve balance and locality

These concerns tie into strong and weak scaling concepts and parallel efficiency covered elsewhere, but the distributed-memory model is where these issues become most visible.

Fault Tolerance and Resilience Considerations

Large distributed-memory systems have a higher chance that some component will fail during long runs. While many current applications still assume a fail-stop model (the job fails and is restarted from scratch), several techniques are relevant:

Checkpoint/restart:

Periodically save state to disk.
On failure, restart from the last checkpoint instead of the beginning.

Algorithm-based fault tolerance (ABFT):

Embed redundancy or checksums into the algorithm itself.

Process recovery / spare processes (system dependent):

Replace failed processes and restore their data from redundancy.

Designing distributed-memory codes with clear, reconstructible global state can make it easier to adopt such strategies.

Design Patterns in Distributed-Memory Programs

Several recurring high-level structures appear in distributed-memory codes:

SPMD (Single Program, Multiple Data)

All processes run the same program.
Behavior diverges based on process rank (an integer ID).
Most MPI-based codes follow this model.

Master–worker (or manager–worker)

One (or a few) processes coordinate work.
Many worker processes perform computations.
Useful for dynamic load balancing and task farming.

Domain-decomposition solvers

Each process holds a subdomain of a physical or logical domain.
Iterative updates, with halo exchanges between neighbors.

Distributed data-parallel operations

Processes jointly operate on large arrays or matrices.
Each process handles a subset; collective operations handle global results.

Choosing an appropriate pattern simplifies reasoning about both correctness and performance.

Pros and Cons of Distributed-Memory Programming

Understanding the trade-offs helps you decide when and how to use distributed memory, and when a hybrid or alternative approach may be preferable.

Advantages:

Scalability across many nodes:

Only feasible way to use the full size of large clusters.

Explicit control over communication:

Can optimize data movement to match problem and hardware.

Natural fit for problems with inherently distributed data:

Large spatial domains, huge datasets, loosely coupled tasks.

Disadvantages:

Higher programming complexity:

Must design decompositions and communication explicitly.

Debugging can be challenging:

Bugs may depend on timing and data distribution.

Performance is sensitive to small design choices:

Communication patterns, message sizes, and synchronization points matter greatly.

These aspects motivate both higher-level abstractions and hybrid models that combine distributed memory with other paradigms.

How Distributed-Memory Fits Into the Broader HPC Landscape

Distributed-memory parallel programming is one layer in a multi-level parallelism hierarchy:

Node-level parallelism:

Shared-memory (threads, OpenMP) and vectorization.

Node-to-node parallelism:

Distributed-memory (processes, message passing).

Accelerator-level parallelism:

GPU or other accelerators attached to nodes.

Most modern high-performance applications use distributed-memory parallelism as the backbone for scaling across nodes, and then integrate other forms of parallelism within each node. Subsequent chapters focus on a specific, widely used distributed-memory library and on combining distributed memory with other programming models.

Introduction to MPI

MPI processes

Point-to-point communication

Collective communication

Communicators

Process topologies

Performance pitfalls