Table of Contents
What “Shared Memory” Means in HPC
In a shared memory system, multiple processing units (cores, sometimes multiple CPUs) can directly access the same physical memory space. Conceptually, all threads see a single, unified address space:
- Any core can read or write to any memory location (subject to OS protection).
- Data can be shared simply by using pointers or references to the same variables.
- There is no need to explicitly “send” data between compute elements (unlike distributed-memory systems using MPI).
This is a programming model as well as a hardware organization. When we say “shared memory system” in HPC, we usually mean a machine whose hardware and operating system present a single coherent memory address space to all cores.
Typical shared-memory machines in HPC include:
- Single-socket servers (multiple cores in one CPU).
- Multi-socket nodes (two or more CPUs sharing main memory).
- Large symmetric multiprocessor (SMP) or cache-coherent NUMA (ccNUMA) systems with dozens or hundreds of cores.
Hardware Organization of Shared Memory Systems
Symmetric Multiprocessing (SMP)
SMP systems have:
- Multiple identical CPUs (or multi-core processors).
- A single, shared main memory.
- A uniform view of memory latency (at least conceptually).
Each processor connects to the same memory subsystem, so the time to access any memory location is approximately the same. This is often called Uniform Memory Access (UMA).
SMP characteristics relevant to HPC:
- Simpler reasoning about performance: memory is “flat.”
- Often limited in scalability because a single memory bus or memory controller becomes a bottleneck.
- Common in smaller nodes or older systems.
Non-Uniform Memory Access (NUMA)
Modern multi-socket HPC nodes are usually NUMA systems. Memory is physically divided into regions (sometimes called “memory domains” or “nodes”), each associated with a CPU socket:
- Each socket has its own local memory controllers and DRAM.
- Cores access local memory faster than remote memory attached to another socket.
- All memory is still addressable by all cores — the system is logically shared memory — but the latency and bandwidth depend on where the memory physically resides.
Key points:
- Still a shared address space: a process can allocate memory and any thread on any core can use it.
- The topology of memory matters; “where” memory lives affects performance.
You will often see this described as cache-coherent NUMA (ccNUMA), because the system also maintains cache coherence across sockets (next section).
Cache Coherence in Shared Memory Systems
Each core typically has its own private caches (L1, often L2) and may share higher-level caches (L3). Since data in memory can be cached on multiple cores, the hardware must ensure that:
- All cores see a consistent view of shared data.
- Writes by one core eventually become visible to other cores.
- No core uses stale values indefinitely.
This is handled by cache coherence protocols at the hardware level (e.g., MESI and its variants). While you usually do not program these protocols directly, they have performance implications in HPC:
- False sharing: Different cores update different variables that happen to share the same cache line, causing unnecessary coherence traffic and slowdowns.
- Contention on shared data: Many cores frequently writing to the same cache line create “hot spots” and coherence bottlenecks.
On a functional level, cache coherence is what makes shared memory programming seem simple: you just read and write variables, and all cores eventually agree on their values. On a performance level, it is a key source of overhead when scaling to many cores.
Types and Scales of Shared Memory Systems
Shared memory systems exist at different scales within HPC infrastructure:
Small-Scale Shared Memory: Single Node
Most HPC cluster nodes are themselves shared memory systems:
- One or more CPU sockets.
- Dozens to hundreds of CPU cores.
- One shared physical memory space.
From a programmer’s perspective:
- A process can spawn multiple threads (e.g., with OpenMP,
std::thread, or pthreads) that all share the same memory. - You use shared-memory parallel programming within the node, while using MPI or other methods between nodes.
Large-Scale Shared Memory: Big SMP Servers
Some HPC centers host large shared-memory machines, sometimes called “fat nodes” or “big iron”:
- Hundreds or even thousands of cores.
- Very large memory (terabytes).
- High-end ccNUMA or special interconnects within the system.
These systems are useful for:
- Memory-bound applications requiring a very large shared memory space.
- Algorithms that are difficult to distribute across multiple address spaces.
However,:
- They are expensive and relatively rare compared to cluster nodes.
- Performance tuning is more complex due to NUMA and coherence effects.
Shared Memory in the Context of HPC Clusters
Even though large clusters are primarily distributed-memory systems (multiple nodes connected by a network), each individual node is usually a shared-memory system:
- Within a node: shared memory across cores and sometimes across sockets.
- Between nodes: separate memory spaces; communication via network (e.g., MPI messages).
This leads to:
- Hybrid programming models (e.g., MPI between nodes, OpenMP within nodes), which you will see in other chapters.
- Resource allocation via the scheduler that typically assigns whole nodes or subsets of cores within nodes; all those cores share the node’s memory.
For this chapter, the key takeaway is that nodes themselves are shared memory systems, and their internal structure influences how you exploit them effectively.
Advantages of Shared Memory Systems for HPC
Some benefits relevant to HPC workloads:
- Simpler programming model: Shared variables, no explicit message passing to share data among threads.
- Fine-grained parallelism: Easy to share small pieces of data or results without heavy communication overhead.
- Low-latency communication within node: Data sharing through memory is faster than sending messages over a network.
- Good for irregular data structures: Linked lists, trees, and other pointer-based structures are easier to use compared to distributed-memory models.
These properties make shared-memory systems particularly attractive for:
- Node-level parallelization.
- Prototyping and development before scaling out to multiple nodes.
Limitations and Scalability Considerations
Shared memory systems do not scale indefinitely. Main limitations in the HPC context:
Memory Bandwidth and Contention
- All cores share some part of the memory subsystem (buses, controllers, DRAM channels).
- As more cores access memory-intensive workloads, they compete for bandwidth.
- At some point, adding more cores yields diminishing returns.
Coherence Overhead
- Maintaining cache coherence across many cores and sockets is expensive.
- Communication traffic for coherence (invalidations, updates) increases with core count.
- Algorithms that frequently update shared data structures may scale poorly.
NUMA Effects
- Accessing remote memory on another socket is slower and may reduce effective bandwidth.
- Ignoring NUMA can lead to unpredictable and poor performance.
- NUMA-aware allocation and thread placement are often necessary to get good performance at scale.
Practical Size Limits
Physical and architectural constraints make it difficult to build a single, coherent shared-memory system beyond a certain size:
- Very large systems become complex and costly.
- Network-based clusters with distributed memory are more scalable in terms of total core count and total memory.
For these reasons, HPC systems are typically clusters of shared-memory nodes, not one monolithic global shared-memory machine.
NUMA-Aware Usage Patterns (Conceptual)
Without going into specific tools or commands, there are general patterns for using NUMA-based shared memory systems effectively:
- Thread affinity: Keep threads on the same cores or sockets so their working sets remain local.
- Memory locality: Allocate memory “close” to the threads that will use it. Many OSes use “first touch” policies: memory is placed near the core that first accesses it.
- Partitioned data structures: Structure data so that each thread mostly works on a region that resides in its local memory.
These techniques aim to:
- Increase the fraction of local memory accesses.
- Reduce cross-socket traffic.
- Minimize coherence overhead.
You do not need to know the exact commands or APIs here; the main idea is that on shared memory systems, where in memory your data lives matters for performance, not just correctness.
Typical Use Cases for Shared Memory Systems in HPC
Within an HPC environment, shared-memory systems are particularly suitable for:
- Single-node jobs that:
- Require a large shared memory space.
- Have algorithms that are naturally expressed using shared variables and threads.
- Pre- and post-processing around large distributed runs:
- Data preparation, filtering, and analysis that can fit into one node’s memory.
- Interactive analysis:
- Using multiple cores on a login or analysis node (where permitted) for faster workflows.
- Hybrid applications:
- Using threads for intra-node parallelism combined with another model for inter-node communication.
Shared memory is therefore a cornerstone of node-level performance in HPC clusters, even when applications ultimately run across many nodes.
Practical Implications for Beginners
When you log in and run on an HPC cluster:
- Each node you are assigned is a shared-memory system.
- If you use threads (e.g., via OpenMP), they share the address space of your process within a node.
- Performance will depend not just on how many cores you use, but on how memory accesses and sharing patterns interact with the underlying shared-memory hardware (caches, NUMA).
Later chapters on shared-memory programming, hybrid models, and performance optimization will build on this understanding of shared memory systems and their role inside HPC clusters.