Kahibaro
Discord Login Register

Distributed memory systems

Concept of Distributed Memory in HPC

In a distributed memory system, each compute node has its own private memory (RAM). No other node can directly read or write that memory. All data exchange between nodes happens explicitly via a network (e.g., Ethernet, InfiniBand).

This is in contrast to shared-memory systems, where multiple processing units see a single, unified memory address space.

Key characteristics:

The cluster as a whole forms a distributed memory machine, even though each node might itself be a shared-memory system internally.

Address Space and Data Locality

In distributed memory systems, each process typically has its own address space on a specific node. An address in one process’s memory has no meaning to another process on another node.

Implications:

A common mental model is: “One process per node (or per core) with its own private memory; processes collaborate by exchanging messages.”

Programming Models for Distributed Memory

The dominant programming model for distributed memory in HPC is message passing:

While MPI can also be used on shared-memory machines, it is especially natural for distributed memory, since it mirrors the underlying hardware structure: independent processes + explicit messages over a network.

Other (less common) models that can target distributed memory systems:

In practical cluster usage, most production distributed-memory applications rely primarily on MPI or MPI combined with node-level parallel models (e.g., MPI + OpenMP).

Data Decomposition and Domain Partitioning

Because memory is not shared across nodes, large problems must be partitioned:

Partitioning goals:

These decompositions are explicitly designed with the underlying distributed memory in mind; good decompositions are essential for performance and scalability.

Communication Patterns and Overheads

Because communication is explicit and relatively expensive compared to local memory access, understanding communication patterns is central to distributed memory systems.

Common patterns:

Overhead considerations:

In distributed memory applications, algorithms are often redesigned to:

Scalability in Distributed Memory Systems

Distributed memory systems are the main path to scaling to very large core counts and problem sizes, because memory and compute are both scaled by adding more nodes.

Two scaling aspects:

Scalability challenges specific to distributed memory:

Designing algorithms for distributed memory often focuses on minimizing or restructuring communication and using more scalable collective operations.

Memory Capacity and Problem Size

One key advantage of distributed memory systems is the ability to aggregate memory across many nodes.

However, because memory is distributed:

This is a fundamental mindset shift from single-node or shared-memory programming.

Fault Tolerance and Reliability Considerations

As the number of nodes grows, the probability that something fails during a long run increases. Distributed memory systems must handle:

Typical strategies (at the application and system level):

Distributed memory architecture influences how checkpointing is implemented, since the global state is physically spread across many nodes and must be saved in a coordinated way.

When to Use Distributed Memory Systems

Distributed memory systems (i.e., HPC clusters with many nodes) are particularly suitable when:

Typical application areas:

Understanding the properties and constraints of distributed memory systems is foundational for designing scalable HPC codes and for making effective use of modern clusters.

Views: 11

Comments

Please login to add a comment.

Don't have an account? Register now!