8 Distributed-Memory Parallel Programming

Table of Contents

Context and Motivation

Distributed memory parallel programming is the dominant model for large scale high performance computing. In this model, a computation is executed by multiple processes, each with its own private memory space. The processes collaborate by explicitly exchanging data over a network. No process can directly read or write another process’s memory. All interaction happens through communication operations provided by communication libraries, most commonly MPI, which will be treated in its own chapter.

In contrast to shared memory programming, where threads see a single common address space, distributed memory programming is designed to scale across many nodes of a cluster. It fits naturally to the hardware of HPC clusters, where each node has its own memory and nodes are connected by an interconnect. This separation of memory has important consequences for how you design data structures, algorithms, and communication patterns.

The Distributed Memory Model

In a distributed memory system, each process has its own address space and typically runs on a different core, often on a different node. If you have $P$ processes, you effectively have $P$ separate memories. A piece of data exists in a given memory only if it has been explicitly allocated and, if necessary, received there.

You can think of this as a set of independent programs that occasionally send messages to each other. Conceptually, each process is executing the same program, but often on different subsets of the data. This style is usually called SPMD, short for Single Program, Multiple Data. Every process runs the same executable, but its behavior can depend on a process identifier, for example a rank in MPI, so that different processes take different roles or work on different portions of the problem.

A central mental shift in distributed memory programming is to stop imagining “a big array that everyone shares” and instead think of “many local pieces of data that together form a global object.” Any operation that involves data owned by multiple processes must be decomposed into local computations plus explicit communication.

Data Decomposition and Ownership

A core design decision in distributed memory programming is the decomposition of your data across processes. This decomposition determines which process “owns” which portion of the data and directly influences how often and how much data needs to travel over the network.

For regular grid based problems such as finite difference or finite volume methods, it is common to divide arrays into subdomains and assign each subdomain to one process. For example, a two dimensional array can be split into horizontal stripes, vertical stripes, or blocks. Each process then stores only the part of the global array that it owns. Computations that depend only on local data can then proceed without communication.

Whenever a computation requires neighbor values that live on other processes, you need to arrange communication. A standard pattern is to use halo or ghost cells. Each process stores a narrow region around the edges of its subdomain that duplicates the interface data from neighboring subdomains. At each iteration, processes exchange boundary data so that halo cells are updated. The computation then uses halo cells like local data, without further communication until the next update.

For irregular problems, such as graphs or unstructured meshes, data decomposition is more complicated. Processes might own subsets of vertices or elements, and communication is required whenever data crosses ownership boundaries. Good decompositions aim to balance computational load and minimize the number and size of messages.

In distributed memory programming, always define a clear ownership rule: every piece of data is owned by exactly one process, and all updates to that data are ultimately performed by the owning process.

The SPMD Execution Style

The SPMD model is the dominant style for distributed memory codes. You typically launch many identical processes. Each process starts at main and runs through the same source code. Differences arise because each process learns its identity at runtime, for instance through a rank number, and then uses that identity to pick its data subset and its role in the algorithm.

A typical distributed memory program has a structure like this. It initializes the communication library. It queries the total number of processes and the current process’s identifier. It partitions the global problem across processes based on these values. Each process performs computations on its local data and occasionally communicates with others to exchange boundary data, combine partial results, or redistribute work. At the end, it finalizes the communication library.

Because all processes execute independently and have separate memories, you must avoid patterns that implicitly assume a global flow of control. There is no shared stack and no global variables that all processes see. Any information that should be common across processes must be either computed identically on each process or communicated explicitly.

Communication and Coordination

Communication in distributed memory programming is explicit. A process that needs data from another process must request it or wait for it to arrive through a message. Likewise, to send data to another process you must call a send operation. You cannot simply dereference a pointer to remote memory.

There are two broad styles of communication. One involves explicit messages, for example sending and receiving buffer contents between specific pairs of processes. The other uses collective operations that involve groups of processes, such as broadcasting a value from one process to all others, or reducing values from all processes into a single result. Both styles are important and are discussed in detail in the chapter on MPI.

The cost of communication is much higher than that of local computation. Each message has a startup overhead and a time proportional to the message size. A simple model for message time is

$$ T_{\text{message}} = \alpha + \beta \cdot n, $$

where $\alpha$ is the latency, $\beta$ is the inverse bandwidth, and $n$ is the number of bytes. Latency dominates for small messages and bandwidth dominates for large ones. This simple model provides a useful way to reason about communication costs when designing algorithms.

To achieve good performance in distributed memory programs, minimize the number of messages and maximize the useful work per message. Try to combine small messages, reduce synchronization points, and arrange computations so that communication can overlap with useful work whenever possible.

Coordination between processes is often necessary. Processes must agree on when to send and receive messages, especially when they depend on each other’s data. Many communication operations imply an ordering and may block until matching operations are posted by partner processes. Poorly coordinated communications can easily lead to deadlocks, which will be treated in a later chapter.

Algorithm Design for Distributed Memory

Designing algorithms for distributed memory systems requires you to balance three aspects: concurrency, locality, and communication. Concurrency means that there must be enough independent work to keep all processes busy. Locality means that most computations use data that resides in the same process’s memory. Communication is necessary whenever these two aims cannot be met simultaneously.

A simple way to approach algorithm design is to start from a sequential algorithm and then identify independent units of work that can execute in parallel. You then map these units onto processes and define the data each unit requires. From this mapping you derive a decomposition of the problem domain and the communication that connects neighboring units.

For example, in a time stepping simulation on a grid, each process can update its portion of the grid using only local and neighbor data. The algorithm for each time step can then be organized as: exchange boundary data with neighbors, update all interior points, update boundary points using the new neighbor data, and proceed to the next step. This pattern introduces concurrency through domain decomposition and limits communication to nearest neighbors.

More complex algorithms, such as distributed sparse linear solvers or global graph algorithms, must deal with irregular communication patterns. Here it is common to separate computation into local phases and communication phases. In local phases each process works only with data it owns. In communication phases processes exchange updated values, redistribute data, or combine partial results. The efficiency of such algorithms depends strongly on how frequently these phases occur and how much data is involved.

A recurring idea in distributed algorithms is to trade extra computation for less communication. For instance, some redundant calculations may be performed on several processes to avoid transmitting intermediate values. This is often a good trade, because floating point operations are much cheaper than moving data across the network.

Load Balancing and Scalability

In distributed memory programming, poor load balance can completely dominate performance. Since processes work independently but often must synchronize at key points, the total runtime can be limited by the slowest or most heavily loaded process. If one process has much more data or a more expensive portion of the domain, all others may have to wait for it at synchronization stages, such as during collective communication.

Load balancing is the art of distributing work and data so that all processes have roughly equal amounts. For structured problems, this can be as simple as dividing arrays evenly across processes. For example, if there are $N$ elements to process and $P$ processes, you might assign approximately $N / P$ elements to each. For irregular workloads, more sophisticated partitioning is needed, sometimes using external partitioning tools.

Scalability in distributed memory refers to how performance changes as you increase the number of processes. Strong scaling asks how the runtime changes when the problem size stays fixed and you increase the number of processes. Weak scaling asks how the runtime changes when you increase the number of processes and the problem size at the same time, usually so that the work per process stays constant.

Because communication overheads typically grow with the number of processes, there are limits to scalability. Beyond a certain point, more processes simply add communication cost and idle time without reducing runtime. Good distributed memory codes are designed to delay this point as much as possible, by minimizing communication volume, avoiding unnecessary synchronization, and keeping computation well balanced.

A distributed memory program is only as fast as its most heavily loaded and most communication bound process. Always analyze both load balance and communication patterns when evaluating performance.

Programming Models and Libraries

Distributed memory parallel programming can be expressed using several programming models. The most widely used in HPC is the message passing model, represented by MPI. In message passing, you think explicitly in terms of sending and receiving data between processes, or of participating in collective operations. The MPI chapter introduces the concrete interface and usage patterns.

There are also higher level libraries and frameworks built on top of message passing. Domain specific libraries for linear algebra, fast Fourier transforms, or unstructured meshes encapsulate distributed data structures and communication patterns so that application codes can focus more on local computations and algorithms. These libraries typically hide the low level message passing from the user while still relying on the same distributed memory concepts.

Some programming models provide partitioned global address spaces, where the programmer writes code as if there were a single global memory but can still reason about locality and where data physically resides. Although the details vary, such models still rest on distributed memory foundations and ultimately use communication to move data between physical memories.

Regardless of the specific library or model, the distinguishing characteristic of distributed memory programming is the explicit separation of address spaces and the need to design algorithms around that separation. Once you become comfortable with data decomposition, explicit communication, and load balancing, the techniques transfer across languages, libraries, and HPC systems.

Practical Considerations for Beginners

For beginners, the most challenging aspect of distributed memory programming is usually not the syntax of communication calls, but the change in thinking. It is helpful to adopt a few practical habits.

First, always draw a picture of how your data is split among processes. Label what each process owns and what information it needs from others. This picture often reveals both missing communications and unnecessary ones. Second, be explicit in your code about ownership and communication. Name variables in ways that clarify whether they are local data, halo data, or buffers for messages.

Third, expect that debugging distributed programs requires different techniques from serial or shared memory codes. The behavior of a distributed program depends on the interplay between processes and the network, and errors can manifest only when many processes run together. Logging, careful checks of assumptions, and incremental scaling from few to many processes are valuable strategies.

Finally, always keep performance in mind while preserving correctness. Start with a correct but perhaps naive communication pattern. Once it works, measure its performance and then refine the data decomposition and communication scheme. Distributed memory programming is an iterative process where algorithmic insight and performance analysis go hand in hand.

8.1 Introduction to MPI

8.2 MPI processes

8.3 Point-to-point communication

8.4 Collective communication

8.5 Communicators

8.6 Process topologies

8.7 Performance pitfalls