Shared memory systems

Table of Contents

Understanding Shared Memory Systems

Shared memory systems are one of the core architectural styles used in HPC clusters. In this chapter, the focus is on what makes a system “shared memory,” how this affects programming and performance, and how these systems fit into a modern cluster.

You have already seen in the parent chapter that HPC clusters combine many nodes. Here we zoom in on a single node, or a tightly coupled group of sockets, where memory is shared among processing elements.

What “Shared Memory” Means

In a shared memory system, multiple processing units such as CPU cores access a single, coherent address space. Every core can, in principle, read and write any memory location using standard load and store instructions, without explicit message passing.

The key idea is a global view of memory. When you write double *a in a C program running on a shared memory system, that pointer refers to the same logical array for all threads in the process. The physical placement of the bytes may differ between memory banks, but conceptually all cores see one common memory.

In contrast, in distributed memory systems, each process has its own private memory, and data must be exchanged explicitly with communication libraries such as MPI. Shared memory systems hide this communication behind the hardware and coherence protocol.

Hardware Organization of Shared Memory

Shared memory is not a single implementation style. Several hardware organizations provide a shared address space, each with different performance characteristics.

In simple symmetric multiprocessor (SMP) designs, a small number of CPU sockets connect directly to a single main memory system. All cores have similar latency to any memory location. This type of design is conceptually straightforward, but it does not scale well to very large numbers of cores because the memory interface becomes a bottleneck.

Modern multi-socket servers typically implement a form of nonuniform memory access, often abbreviated NUMA. Each CPU socket has its own local memory controller and attached memory modules. The entire set of memory modules across sockets forms a single address space that all cores can access. However, memory that belongs to a remote socket has higher latency and often lower bandwidth than memory that belongs to the local socket.

Although programmers see a single address space, the cost of accessing different parts of this space varies. This leads to the concept of memory locality within a shared memory system.

Cache Coherence and Consistency

Shared memory is useful only if cores see a consistent view of data. To achieve this, modern processors implement cache coherence protocols. Each core has private caches that store copies of recently accessed data. When one core writes to a memory location, the hardware must ensure that no other core keeps an outdated cached copy.

Coherence protocols track ownership and sharing states of cache lines. When a core writes to a cache line, other cores’ cached copies are invalidated or updated according to the protocol. This allows the system to maintain the illusion of a single up to date value for each memory location even though multiple copies may exist temporarily.

On top of coherence, the architecture defines a memory consistency model, which specifies the rules that govern the order in which memory operations appear to execute. Programmers and compilers rely on this model to reason about parallel code and synchronization.

For most beginners, the practical consequence is that ordinary loads and stores work as expected, as long as you use correct synchronization primitives such as locks and barriers around shared data. Details of the consistency model and low level atomic operations are typically encapsulated inside threading libraries and higher level APIs.

In shared memory systems, cache coherence ensures that all cores eventually see a consistent value for each memory location, but correct synchronization is still required to avoid race conditions and undefined behavior.

Uniform vs Nonuniform Memory Access

A fundamental distinction among shared memory systems is between uniform memory access (UMA) and nonuniform memory access (NUMA).

In a UMA system, all cores have approximately the same latency and bandwidth to any memory address. Traditional small SMP servers approximate UMA behavior. Program performance depends primarily on the overall memory bandwidth and cache behavior, not on which core accesses which part of memory.

In a NUMA system, each core has fast access to its local memory region and slower access to memory on other sockets. The access time and bandwidth depend on the distance in the interconnect topology. As the system scales to more sockets and more memory controllers, these differences can become significant.

From the point of view of the programming model, both UMA and NUMA expose a single address space. However, NUMA introduces an additional performance dimension: where the data is located relative to the thread that accesses it.

On NUMA shared memory systems, accessing local memory is faster than accessing remote memory. For good performance, aim for data locality, where each thread mostly accesses data stored in its local NUMA region.

Shared Memory in HPC Nodes

In typical HPC clusters, each compute node is itself a shared memory system, often with multiple sockets and many cores per socket. The cluster then combines these nodes into a larger distributed memory system.

A single HPC node can have dozens or even hundreds of CPU cores, all sharing some level of memory. Within this node, you can take advantage of shared memory programming models, while across nodes you must use distributed memory models.

This node level shared memory system often includes:

Multiple CPU sockets, each with its own local memory banks.

A NUMA interconnect between sockets, which provides a unified address space.

A multi level cache hierarchy, which interacts with the coherence protocol.

Operating system level support for pinning threads to cores and controlling memory placement across NUMA regions.

Understanding shared memory on a node is essential before building hybrid applications that combine shared and distributed memory approaches.

Programming Models for Shared Memory

Shared memory systems are naturally suited to threaded programming models, where multiple threads of execution share the same process address space.

In HPC, a common shared memory programming model is OpenMP, which provides compiler directives, runtime routines, and environment variables to create and coordinate threads. Although other models exist, such as POSIX threads and task based runtimes, OpenMP is popular because it allows incremental parallelization of existing serial code.

The key idea is that threads collaborate on a single data set in memory. Work can be divided by partitioning loops, assigning different iterations to different threads, or by decomposing tasks that operate on shared structures.

Because all threads can access the same data, shared memory models are often easier to apply to fine grained parallelism than distributed memory models. However, the freedom to access any data also introduces the risk of unintended interference. Coordination through synchronization primitives is essential.

Synchronization and Shared Data

Shared memory parallelism depends on controlled cooperation among threads that share data. Without proper synchronization, multiple threads may read and write the same memory locations concurrently, which can lead to race conditions and incorrect results.

Typical synchronization mechanisms in shared memory environments include mutexes or locks, condition variables, barriers, and atomic operations. These mechanisms establish ordering constraints and mutual exclusion around critical sections.

For example, if several threads update a global counter, each increment must be protected, either by a lock or by an atomic increment, so that increments are not lost. Similarly, if one thread produces data that others consume, those consumers must wait until the producer signals that the data is ready.

The cost of synchronization is not only the direct overhead of locks or atomics but also the indirect cost of cache coherence traffic. Frequent writes to shared data that is heavily contended can lead to performance degradation even if the program is functionally correct.

In shared memory programs, correctness requires that all concurrent accesses to shared data be correctly synchronized. Unsynchronized read/write or write/write access to the same location by multiple threads leads to race conditions and undefined behavior.

Memory Contention and Bandwidth Limits

Shared memory systems inherently share hardware resources among multiple cores. While this enables easy data sharing, it also creates potential contention.

All cores share the total memory bandwidth available from the memory controllers. If many cores perform memory intensive operations at once, they may saturate the memory subsystem. Once saturation occurs, adding more threads does not increase performance and can even reduce it because of overhead and contention.

Caches are also shared at some levels. Last level caches are often shared among cores on a socket. If different threads work on large, independent data sets, they can evict each other’s cache lines, which reduces cache effectiveness.

Furthermore, contention can occur within the coherence protocol itself. When multiple threads write to the same cache line, even to different variables that happen to reside on that line, the hardware must repeatedly transfer ownership of the line among caches. This leads to a phenomenon known as false sharing, which can significantly hurt performance while leaving the program’s numerical results unchanged.

Performance aware programming on shared memory systems involves structuring data and access patterns to reduce unnecessary sharing and contention, and to keep working sets within local caches where possible.

NUMA Awareness and Thread Placement

In NUMA based shared memory systems, how threads are mapped to cores and how memory is allocated across NUMA regions can strongly influence performance.

If a thread frequently accesses data that resides in a remote NUMA region, it incurs extra latency and consumes bandwidth on the inter socket interconnect. When many threads behave this way, the interconnect can become a bottleneck.

Operating systems provide mechanisms to control thread affinity, which is the binding of threads to specific cores or sockets, and to control memory placement policies. For instance, a “first touch” policy allocates physical memory in the NUMA region of the core that first writes to the page. By initializing data structures in the same parallel pattern as their use, you can often achieve good locality automatically.

NUMA aware programming on shared memory systems aims to align computations, threads, and data so that most memory accesses are local.

On NUMA shared memory systems, bind threads to specific cores or sockets and allocate data close to where it will be used. Poor placement can cause remote memory accesses, higher latency, and reduced effective bandwidth.

Scaling Limits of Shared Memory

Shared memory systems are attractive because of their simple programming model. However, they do not scale arbitrarily.

As the number of cores grows, hardware must maintain coherence across more caches and more NUMA regions. The complexity and cost of coherence protocols increase, and the interconnect linking sockets and memory controllers becomes more heavily loaded.

Memory bandwidth per core often decreases as more cores are added to a socket, because the number of memory channels grows more slowly than the core count. For memory bound applications, this per core bandwidth limit can prevent linear speedup, even on a single shared memory node.

These factors limit how far pure shared memory systems can scale. This is one of the reasons why large HPC systems combine multiple shared memory nodes with distributed memory interconnects, and why hybrid programming models have become common.

Shared Memory within the Cluster Context

Within an HPC cluster, shared memory systems primarily exist at the node level. Each node is a shared memory machine that participates in a larger distributed memory system.

This structure has several implications.

First, the programming model within a node can be different from the model across nodes. Many applications use threads and shared memory on each node, and processes with explicit message passing across nodes.

Second, the performance characteristics are hierarchical. Threads on the same core share some resources, threads on the same socket share caches and local memory, threads on different sockets share inter socket interconnects, and nodes share network links. Effective use of shared memory requires an understanding of this hierarchy.

Third, debugging and performance tuning often start at the shared memory level. If an application scales poorly on a single node due to memory contention, false sharing, or poor NUMA placement, those issues will persist or worsen when you extend to multiple nodes.

By mastering shared memory behavior at the node level, you build a foundation for efficient programming on entire clusters and for exploiting hybrid parallel approaches that combine the strengths of shared and distributed memory systems.

Comments

Please login to add a comment.

Don't have an account? Register now!