Table of Contents
What the Memory Hierarchy Is About
The memory hierarchy is the layered structure of storage components in a computer system, arranged by speed, cost, and capacity. In HPC, understanding this hierarchy is crucial because:
- It largely determines how fast your code can actually run.
- Poor use of the hierarchy can make a fast CPU sit idle waiting for data.
- Many performance optimizations are really “memory hierarchy optimizations.”
At a high level, you have:
- Very small, very fast storage close to the CPU (registers, caches).
- Larger, slower main memory (RAM).
- Even larger, much slower storage (SSDs, disks), typically discussed elsewhere.
This chapter focuses on how these layers relate to each other and why this matters for HPC performance, leaving detailed internals of each specific level to their own subsections.
Key Trade-Offs: Speed, Capacity, Cost, and Distance
The memory hierarchy exists because no single kind of memory can simultaneously be:
- As fast as CPU logic,
- As large as we’d like,
- As cheap as bulk storage,
- As energy-efficient as we need.
Designers therefore combine multiple levels with different characteristics:
- Closer to the CPU: smaller, faster, more expensive per byte.
- Farther from the CPU: larger, slower, cheaper per byte.
Conceptually:
- Latency: time to access one piece of data.
- Bandwidth: how much data per second can be moved.
- Capacity: how much data can be stored.
- Energy: how much power it takes to keep and move data.
As you move away from the CPU:
- Latency ↑ (slower)
- Bandwidth ↓ (lower)
- Capacity ↑ (bigger)
- Cost per byte ↓ (cheaper)
Typical Memory Hierarchy Levels
A simplified CPU-centric view looks like:
- Registers (closest, fastest, smallest)
- L1 cache (small, very fast)
- L2 cache (bigger, slightly slower)
- L3 cache (big shared cache on many CPUs)
- Main memory (RAM) (much bigger, slower)
- Non-volatile storage (SSD/HDD; typically considered in I/O chapters)
You’ll see some variations, but this pattern—multiple cache levels plus RAM—is standard in modern HPC systems.
The Memory Wall
In many workloads, CPU speed increases faster than memory speed. This leads to the memory wall:
- Computation is fast.
- Data movement from memory is comparatively slow.
- Performance becomes memory-bound rather than compute-bound.
In practice this means:
- A CPU can execute multiple operations per cycle, but often waits on data.
- Achieved FLOP/s (floating-point operations per second) may be limited by memory bandwidth/latency rather than by the theoretical peak.
For HPC, managing where your data lives and how it moves through the hierarchy can matter more than the number of cores or clock speed alone.
Temporal and Spatial Locality
The hierarchy works well because many programs exhibit locality:
- Temporal locality: If you use data now, you’re likely to use it again soon.
- Spatial locality: If you use data at one address, you’re likely to use nearby addresses soon.
Caches and other parts of the hierarchy are designed to exploit this:
- Data is moved in blocks (cache lines) rather than single bytes.
- Recently accessed data is kept close to the CPU in case it is reused.
- Sequential access patterns are rewarded with efficient transfers.
For HPC performance:
- Accessing arrays in contiguous order (e.g.
for (i = 0; i < N; ++i)) tends to be faster than random access. - Reusing data soon after loading it (e.g. blocking/tiling algorithms) makes better use of the cache.
Latency vs Bandwidth
Two distinct performance aspects of memory:
- Latency: How long until the first byte of a request arrives.
- Bandwidth: Once transfer starts, how many bytes per second can you sustain.
Different levels in the hierarchy balance these differently:
- Caches: optimized for low latency for small accesses.
- RAM: higher latency than caches, but relatively high bandwidth for large transfers.
- Storage: very high latency (especially disks) and relatively low bandwidth per process.
HPC codes can be:
- Latency-sensitive: Many small, irregular accesses (e.g. sparse data structures).
- Bandwidth-sensitive: Large, streaming accesses (e.g. reading long vectors).
Optimizations often target whichever is the main bottleneck.
Effective vs Theoretical Bandwidth
Theoretical peak bandwidth is given by:
$$
\text{Peak bandwidth} = \text{Bus width} \times \text{Transfer rate} \times \text{Number of channels}
$$
In practice, your effective bandwidth is lower due to:
- Protocol overheads.
- Non-optimal access patterns (e.g. small, scattered loads).
- Contention from other cores and processes.
HPC benchmarks (e.g. STREAM) measure sustainable memory bandwidth and give a more realistic picture of what your code can achieve.
Multi-Level Caches and Sharing
Modern CPUs often have:
- Private L1 and L2 caches per core.
- A shared L3 cache among multiple cores on a socket.
Implications for HPC:
- Data sharing between threads on the same CPU can benefit from the shared cache.
- False sharing (threads updating different data that happens to share a cache line) can cause unnecessary traffic within the hierarchy.
- NUMA effects appear when multiple memory controllers and sockets are present (covered in more detail in other chapters).
Knowing which data is truly shared and how it’s laid out in memory affects how efficiently the hierarchy and sharing mechanisms work.
Bandwidth vs Core Count
As core counts per socket grow, per-core memory bandwidth can drop if memory bandwidth does not increase proportionally:
- A socket with 2 memory channels and 4 cores might give each core plenty of bandwidth.
- The same socket design with 32 cores now provides much less bandwidth per core.
Result:
- Adding more threads or processes on a node may not increase performance if memory bandwidth is already saturated.
- Many HPC applications show good speedup up to some number of cores per node, then flatten out or even slow down.
This is a direct consequence of how multiple cores share the memory hierarchy.
Implications for Parallel HPC Codes
The memory hierarchy affects parallel performance at multiple levels:
- On a single core: register and cache usage determines how many operations per cycle you can sustain.
- Within a node: cores share part of the hierarchy and main memory bandwidth; bad access patterns can cause contention.
- Across nodes: each node has its own memory; communication between nodes is constrained by network bandwidth/latency, separate from the local hierarchy.
Typical high-level implications:
- Algorithms that maximize data reuse (do more work per byte loaded) run faster.
- Algorithms that minimize random access and favor contiguous access use caches and RAM more efficiently.
- The “best” parallel decomposition of a problem often depends on memory hierarchy, not just compute capacity.
Roofline Perspective
A common way to relate computation and memory hierarchy is the roofline model, which plots:
- Achievable performance (FLOP/s) vs
- Operational intensity: FLOPs per byte moved from memory.
Given a memory bandwidth $B$ (bytes/s) and peak compute $P$ (FLOP/s):
- If an algorithm has operational intensity $I$ (FLOPs/byte), the memory-bound performance limit is:
$$
P_{\text{mem}} = B \times I
$$
- Achievable performance is approximately:
$$
P_{\text{achievable}} = \min(P,\, B \times I)
$$
This shows directly how the memory hierarchy (through $B$) sets a ceiling on performance for memory-intensive codes.
Practical Takeaways for Beginners
At this stage, you don’t need to memorize hardware details, but keep these ideas in mind:
- Data closer to the CPU is faster; the further away, the slower.
- Access patterns matter: contiguous and reused data is “cheap,” random and one-time-use data is “expensive.”
- Using more cores does not always mean faster if memory bandwidth becomes the bottleneck.
- Many HPC “tricks” (blocking, tiling, loop reordering, structure-of-arrays vs array-of-structures, etc.) are about using the memory hierarchy better.
Following chapters on registers, caches, and main memory will look at specific levels in more detail and connect them to concrete programming practices.