Table of Contents
Context within HPC Clusters
Distributed memory systems are one of the two fundamental ways to organize memory in parallel computers, together with shared memory systems. At the cluster level, distributed memory is the dominant model, since each compute node has its own local memory that is physically separate from the memory of other nodes. Understanding what this implies is essential before learning how to program such systems with MPI or how to design scalable applications.
What “Distributed Memory” Really Means
In a distributed memory system, each processing element, typically a node or sometimes a process, owns a private portion of memory. No other processor can directly read or write that memory. Access to nonlocal data is only possible through explicit communication over an interconnect such as Ethernet or InfiniBand.
It is useful to think of a distributed memory system as a set of islands. Each island has its own storage and compute capability. If an island needs information that lives on another island, someone must physically carry it across the water. In computing terms, that “someone” is a communication operation that sends data across the network.
The opposite model is shared memory, where multiple processors can directly access the same address space. In distributed memory there is no such single global address space. Instead, each process or node has its own address space and addresses are only meaningful locally.
In a distributed memory system, remote data is never accessed by a simple load or store instruction. All remote access must go through explicit communication across the interconnect.
The Global View vs Local View of Data
From the point of view of an application, it is often natural to think in terms of global data structures, for example a large array, a matrix, or a graph. In a distributed memory system, such global structures do not exist as a single block in memory. Instead, they are partitioned and each part is stored on a different node.
This leads to two complementary views.
The global view is how the algorithm designer imagines the data. For instance, a matrix $A$ of size $N \times N$ that must be factored or a huge 3D mesh of a physical domain that must be simulated.
The local view is what each process actually stores. Each process holds only its piece of the global matrix or mesh, typically a contiguous subdomain or some other partition. The mapping between the global indices and the local indices is part of the program logic. When a process needs a portion of the data that belongs to another process, it must request it and the owner must send it.
A simple one dimensional example illustrates this. Consider a distributed array $x$ of length $N$ stored across $P$ processes. A common choice is a block distribution where each process $p$ owns a consecutive chunk of size approximately $N/P$. If we label processes from $0$ to $P - 1$, and use zero based indexing for $x$, then elements from index $i_{\text{start}}(p)$ to $i_{\text{end}}(p)$ will be stored on process $p$.
If all chunks are equal size, then
$$
\text{local\_size} = N/P,
$$
and for process $p$,
$$
i_{\text{start}}(p) = p \cdot \text{local\_size}, \quad
i_{\text{end}}(p) = (p+1)\cdot \text{local\_size} - 1.
$$
The code on each process uses its own local index from $0$ to local_size - 1, and maps these to global indices when needed. The mapping strategy and choice of distribution pattern have very large effects on communication volume and performance.
Node-Level vs System-Level Distributed Memory
On a typical HPC cluster, each compute node is itself a shared memory system, often with multiple CPU sockets and many cores per socket that share main memory. The entire cluster, however, forms a distributed memory system across nodes. It is therefore useful to distinguish two levels.
At the node level, processes or threads share access to the same physical memory chips. Shared memory programming models, such as OpenMP, are often used inside a node.
At the system level, nodes do not share memory. Processes that live on different nodes must communicate over the network. Message passing libraries, most commonly MPI, implement this mode of operation. When you run a cluster wide job, you usually combine intra node shared memory with inter node distributed memory, which leads to hybrid programming models that are discussed separately.
In this chapter, “distributed memory” refers primarily to the system level behavior where memory is private to each node or process that participates in computation.
Communication through Message Passing
Since direct remote memory access is not part of the basic hardware abstraction, distributed memory systems rely on message passing. A message is a packet of data that is sent from one process or node to another one. Messages traverse the interconnect and arrive in the memory of the destination.
The basic logical operations are send and receive. A send operation specifies what data to send, to which destination, and often a tag or identifier to distinguish different kinds of messages. A receive operation waits for a matching message, then stores the incoming data in a local buffer.
At a conceptual level, we can describe a simple point to point data transfer in a distributed memory environment as
$$
\text{send}(\text{data}, p_{\text{dest}}), \quad
\text{recv}(\text{buffer}, p_{\text{source}}).
$$
The interplay between send and receive operations, and the detailed semantics such as blocking or nonblocking behavior, are the subject of the message passing programming model.
Although the ability to send arbitrary messages might seem flexible, every message incurs overhead. The cost of communication depends on latency and bandwidth. Latency is the time to initiate and complete a tiny message, while bandwidth is the amount of data that can be transferred per unit time for large messages. On a distributed memory system, achieving high performance requires structuring communication so that latency costs are amortized and bandwidth is used efficiently.
Memory and Data Distribution Strategies
One of the defining tasks in distributed memory programming is the choice of how to distribute data across processes. The physical memory is already distributed by the hardware, but the application designer chooses the logical mapping between global data structures and local memory segments.
A very common pattern is domain decomposition. In domain decomposition, the global problem domain, for example a spatial region or index set, is partitioned into subdomains, and each subdomain is assigned to a process. Each process stores data associated with its subdomain, works on it mostly independently, and exchanges data across the subdomain boundaries as needed.
Several distribution patterns frequently appear in practice:
Block distribution. The global index range is divided into contiguous chunks. Each process receives one chunk. This keeps communication localized to boundaries and tends to be simple to implement. The array example above uses a block distribution.
Cyclic distribution. Indices are assigned in a round robin fashion to processes. For instance, element $0$ goes to process $0$, element $1$ to process $1$, and so on, wrapping around after process $P - 1$. This can help with load balancing when work per index varies.
Block cyclic distribution. A combination of the two, where contiguous blocks of fixed size are assigned cyclically to processes. This pattern is frequently used in distributed linear algebra libraries to achieve both load balance and good communication properties.
Two dimensional and higher dimensional decompositions. For matrices and multidimensional arrays, processes are laid out conceptually in a grid, and contiguous blocks of the array are mapped to this grid. For example, a process grid of size $P_r \times P_c$ can own subblocks of an $N \times N$ matrix. Each process stores an approximately $(N / P_r) \times (N / P_c)$ submatrix. This two dimensional block cyclic layout is used in many scalable dense linear algebra libraries because it spreads data and communication evenly.
The key observation is that distributed memory systems force data distribution to become a first class design choice. A good distribution can reduce communication volume and lead to scalable performance. A poor distribution can create bottlenecks and excessive network traffic.
Locality and Communication Costs
On a single shared memory node, locality often refers to cache and NUMA effects, and the main cost is memory access time compared to CPU speed. In a distributed memory cluster, locality extends across nodes: local memory access is much cheaper than remote data transfer over the network.
A simple performance model captures this cost difference. If a message of size $m$ bytes is sent over the interconnect, its time to transfer can be approximated by
$$
T_{\text{comm}}(m) = \alpha + \beta m,
$$
where $\alpha$ is the latency, the start up cost independent of message size, and $\beta$ is the inverse bandwidth, the time per byte transferred.
By contrast, a local memory access does not incur $\alpha$ or $\beta$ from the network. It is limited by memory or cache speeds instead. The ratio between local memory cost and remote communication cost is very high, so distributed memory algorithms are designed to maximize local computation and minimize communication.
On distributed memory systems, local computation is far cheaper than remote communication. Algorithm design should aim to maximize computation per communicated byte and reduce the number of messages.
This has several practical implications.
First, communication free regions of code are particularly important for scalability. If each process can perform a large portion of its work using only local memory before needing to exchange information, then the relative impact of communication is small.
Second, the number of messages matters as much as the total volume of data. Because of the latency term $\alpha$, many small messages are usually worse than fewer larger messages. Distributed memory codes typically aggregate data into larger buffers where possible.
Third, algorithmic choices often change in a distributed memory context. For example, algorithms that require frequent global synchronization or all to all communication tend to scale poorly as the number of nodes increases.
Scalability and Distributed Memory
Distributed memory systems make it possible to build large machines composed of many commodity nodes. This hardware structure in turn shapes how software can scale.
If we consider an application that uses $P$ processes, each with its own memory, the total memory available scales approximately with $P$. For problems where memory capacity is the limiting factor, this allows you to attack much larger problems by adding more nodes.
Parallel performance can be sketched with a simple model. Suppose that a problem requires $W$ units of work and $C$ units of communication cost on a single process. If we distribute the work evenly across $P$ processes and ignore overheads, the ideal time per process is $W / P$. In reality, communication cost typically increases with $P$, so the parallel execution time looks more like
$$
T_P \approx \frac{W}{P} + C_P,
$$
where $C_P$ is the communication cost that depends on the number of processes and the algorithm.
On a distributed memory system, $C_P$ includes both point to point messages and collective operations that involve many or all processes. When designing distributed algorithms, a central goal is to keep $C_P$ small compared to $W / P$, especially for large $P$.
This leads to a distinction between algorithms that are communication avoiding and those that are communication intensive. Communication avoiding algorithms reorganize computation to reduce the amount or frequency of data exchange, often by recomputing certain intermediate quantities locally instead of fetching them remotely. On distributed memory systems it is often profitable to spend more floating point operations to save communication.
Fault Isolation and Reliability Aspects
Distributed memory systems have some advantages for reliability and fault isolation. Because each node manages its own memory and processes, a failure in one node does not directly corrupt the memory contents of others. At the hardware level, one node can crash without physically damaging the memory of neighboring nodes.
From the software point of view, however, a failed node in a tightly coupled parallel job often leads to global failure, because the other processes are blocked waiting for messages that will never arrive. Techniques such as checkpointing and restart, discussed elsewhere, are therefore particularly important on large distributed memory machines. They allow the overall job to cope with node failures, even though each node is an independent memory domain.
Another reliability related aspect is that address spaces are disjoint. A stray pointer in a process cannot accidentally overwrite memory in another process or node. This can simplify some aspects of debugging, although distributed programs have many other challenging bugs related to communication patterns and synchronization.
Programming Implications of Distributed Memory
Programming a distributed memory system requires an explicit parallel model that expresses both computation and communication. The most common model is message passing, where each process executes its own program instance, communicates through send and receive operations, and cooperates to solve a common problem.
In this model, each process has
its own rank or identifier,
its own address space, for example its private set of variables and arrays,
its own control flow, which may be identical or differ from other processes.
This style of programming is sometimes called Single Program, Multiple Data (SPMD). All processes run the same program binary, but operate on different partitions of the data.
A major shift compared to sequential programming is that the programmer must keep track of data ownership. For each piece of data, you need to know which process owns the valid copy. Only the owner can update it directly. Other processes must request updates or copies as needed.
Arithmetic and other local operations typically look similar to sequential code, but are applied to local subsets of arrays. Communication code surrounds or interleaves with computation, handling halo exchanges, reductions, broadcasts, and other patterns required by the algorithm.
Hybrid approaches combine distributed memory models across nodes with shared memory models inside nodes. In that case, the distributed memory concepts described here still apply, but are only responsible for inter node collaboration.
Typical Use Cases and Application Patterns
Distributed memory systems dominate large scale scientific and engineering computing because they are the only practical way to aggregate enough memory and compute power from many nodes. Certain application patterns map particularly naturally to distributed memory.
Large scale simulations and domain based PDE solvers work well with domain decomposition. Each process owns a spatial subdomain, stores the associated fields in its local memory, and exchanges boundary values with neighboring subdomains at each timestep.
Particle based simulations distribute particles across processes. Each process stores the attributes of its subset of particles, computes interactions with nearby particles, and migrates particles between processes as they move across domain boundaries.
Distributed linear algebra libraries partition matrices and vectors across processes. Each process stores blocks or tiles of the global objects. Operations such as matrix multiplication, factorization, or eigenvalue computation are built on carefully designed communication patterns that respect the underlying distribution.
Graph and network analytics distribute vertices and edges across processes. Each process maintains the portion of the graph assigned to it and exchanges frontier or boundary information during traversals or iterative algorithms.
All these examples share the same key pattern. Data is partitioned across private memory spaces. Computation is done locally where the data resides. Communication is used to maintain the global logical structure, for example consistent arrays, fields, or graphs, across the distributed memory of the system.
Summary
Distributed memory systems provide a model for parallel computing where each node or process has its own private memory, and collaboration occurs through explicit communication. They are the foundation of modern HPC clusters because they allow total memory and compute capability to scale with the number of nodes.
Programming such systems revolves around data distribution, message passing, and careful attention to communication costs and locality. These properties influence algorithm design, scalability, and how you structure code. The concepts introduced here underpin later chapters on message passing interfaces, parallel algorithms, and hybrid programming on real clusters.