2.2 Memory hierarchy

Table of Contents

Overview of the Memory Hierarchy

High performance computing depends very strongly on how fast data can move, not only on how fast a processor can execute instructions. The memory hierarchy is the structured arrangement of different kinds of storage close to the CPU, each with its own speed, capacity, and cost.

At the top of this hierarchy the CPU can access data in a few cycles. Farther down, it may take hundreds of cycles or more. HPC performance is often limited by these delays, not by raw computation. Understanding this hierarchy helps you write programs that keep data where it can be accessed quickly, and avoid frequent slow transfers.

In this chapter you will see the hierarchy as a whole and learn the main trade offs. Later chapters will focus on individual levels such as registers, caches, and RAM in more detail.

The central rule of the memory hierarchy is:
Faster memory is smaller and more expensive, slower memory is larger and cheaper.
The goal in HPC is to maximize the fraction of time the CPU reads and writes data from the fastest levels of the hierarchy.

Typical Structure of the Memory Hierarchy

A modern HPC node usually includes several levels of storage between the CPU and long term disk storage. In a simplified view, from fastest and smallest to slowest and largest, you see:

CPU registers
Level 1 (L1) cache
Level 2 (L2) cache
Level 3 (L3) cache, sometimes called the last level cache
Main memory, or RAM
Local storage such as SSDs and hard disks
Remote or network storage such as shared parallel filesystems

In this chapter we focus on the first part of this stack up to main memory, because these levels most directly affect computational performance inside a node. Storage systems and parallel filesystems are covered elsewhere.

In typical systems, each lower level in this stack is at least one order of magnitude larger, but also an order of magnitude slower, than the level above it. This creates a series of trade offs that compilers and programmers must navigate.

Latency, Bandwidth, and Capacity

There are three key properties that characterize each level in the hierarchy.

Memory latency is the time it takes from when the CPU requests a piece of data until the first byte of that data is available. Latency is often measured in CPU cycles or in nanoseconds. Small memories like registers and L1 caches have very low latency. Main memory and beyond have much higher latency.

Memory bandwidth is the rate at which data can be transferred once a transfer is in progress. It is usually measured in gigabytes per second or similar units. A level with high bandwidth can deliver large amounts of data per unit time, even if the latency to the first byte is nontrivial.

Capacity is the total amount of data that can be stored at a given level. Capacity increases significantly as you move from registers out to main memory and then to disks and remote storage.

These properties are in tension. A very large memory with extremely low latency and very high bandwidth would be ideal, but is not practical given physical and economic constraints. Instead, hardware designers build a hierarchy where each level provides a compromise among these three properties, and where different levels are optimized for different typical access patterns.

The Memory Wall and Its Impact on HPC

CPU performance has historically grown faster than memory performance. As a result, in many applications the CPU spends a large fraction of time waiting for data from memory. This mismatch is sometimes called the memory wall.

In an ideal situation, a CPU core can perform one or more operations on data each cycle. If the data is not already nearby, however, a memory access to RAM might require on the order of one hundred cycles or more. During this time the core may have very little useful work to perform, so the effective performance drops.

Caches and other techniques try to hide this latency, but for many scientific and engineering workloads, access to main memory remains the limiting factor. In HPC, a common outcome is that you cannot fully use the peak floating point capability of a node, because data does not arrive fast enough.

To analyze this situation more formally, people often consider two rates: the peak floating point operations per second of a processor, and its peak memory bandwidth. The ratio of these two values is the machine balance or machine arithmetic intensity. Many applications fall on the memory bound side of this ratio, where performance is determined by how fast data can be delivered.

For many HPC applications, performance is memory bound, not compute bound.
This means that optimizing data movement and locality can produce larger speedups than optimizing arithmetic operations alone.

Data Locality: Temporal and Spatial

The memory hierarchy works well only if programs exhibit data locality. Data locality describes how a program uses data over time and space in memory.

Temporal locality means that if a program uses a data item at some point, it is likely to use the same data item again soon. If a piece of data remains in a cache between uses, the second and later accesses will be much faster.

Spatial locality means that if a program uses a data item at some address, it is likely to use nearby data soon. Caches exploit this by fetching not just a single word, but an entire block or cache line at once. If a program then accesses adjacent array elements, these accesses can be served from the cache line already fetched.

The memory system is designed to work best when both forms of locality are present. Many numeric HPC algorithms are structured around arrays and loops precisely because these structures can provide good spatial and temporal locality when written carefully.

When locality is poor, for example when data accesses jump around in memory in an irregular or random pattern, caches are much less effective and most memory accesses fall back to slower levels.

Multilevel Caches and Inclusive Structure

To mitigate the memory wall, processors include multiple cache levels between registers and main memory. The closest level, L1 cache, is very small but has the lowest latency. Deeper levels, such as L2 and L3 caches, are larger but slower.

Data moves through the hierarchy in blocks or lines. When a load from main memory is requested, a cache line is allocated and filled. If that line is later requested by another load, and has not been evicted, it can be read from a cache level instead of main memory.

Many systems use an inclusive cache structure, where the contents of L1 are also present in L2, and the contents of L2 are present in L3. This is convenient for keeping coherence, which is the property that all cores see consistent values for shared data, and for quickly invalidating lines that have been modified elsewhere. Some architectures use other policies, but the basic effect is the same: frequently accessed data tends to be kept closer to the core.

For programmers, the detailed coherence protocols are usually hidden, but the multilevel nature of caches is important. Some data fits in L1 cache, some only in L2 or L3, and some must come repeatedly from RAM. The portion of a data structure that fits in the upper levels can be processed much more quickly.

Working Sets and Cache Behavior

The working set of a portion of code is the collection of data that is actively used during a certain time interval. If the working set fits in a particular cache level, the code can run mostly at that cache’s speed. If the working set is larger than a cache level, some data will be evicted before it is reused. Then, when it is needed again, it must be fetched from a lower level.

This effect creates abrupt changes in performance as problem sizes grow. For small problem sizes, the entire working set might fit in L1 cache. As the size grows, it may still fit in L2, then in L3, and finally exceed the last level cache and spill intensely into main memory.

In HPC, you often see performance plots where the runtime per operation is roughly constant up to a certain problem size, then increases in steps as the working set size crosses these cache thresholds. Understanding which parts of your algorithm form the working set, and at which scales, helps you reason about observed performance.

NUMA and Memory Access on Multisocket Systems

Many HPC nodes contain more than one physical CPU socket. Each socket has its own memory controllers and a local portion of the installed RAM. The combined memory across sockets presents a single logical address space to the operating system, but physically the memory is arranged into domains that are closer to one socket than another.

This architectural style is called Non Uniform Memory Access, or NUMA. Accessing memory local to a socket has lower latency and often higher bandwidth than accessing memory that is attached to another socket. The difference can be large enough to matter for performance sensitive codes.

On such systems, the physical placement of data in memory relative to the cores that use it has a measurable impact. Operating systems and runtime libraries try to allocate memory near the cores that first touch it, but this is not always sufficient. In HPC, it becomes important to understand that not all main memory accesses are equal, even within a node.

While the details of NUMA policies and control mechanisms are covered later, the key idea here is that the memory hierarchy extends not only vertically, from registers to RAM, but also horizontally across different memory domains with different access speeds.

The Cost Model for Memory Operations

To reason about performance, it is useful to attach rough costs to different categories of memory operations. The exact numbers depend heavily on the particular hardware, but the ratios between them tend to be similar across architectures.

You can think of memory access time as a step function with several plateaus. A load satisfied from a register is essentially free, because the value is already in the CPU’s immediate working space. A load from L1 cache might cost on the order of a handful of cycles. A load from L2 might cost several times more, and a load from L3 still more. A load from main memory can cost tens to hundreds of cycles.

Very roughly, an abstract latency model might look like:

$$
T = T_{\text{reg}} \ll T_{\text{L1}} \ll T_{\text{L2}} \ll T_{\text{L3}} \ll T_{\text{RAM}}
$$

where $T_{\text{reg}}$ is the time for register access and $T_{\text{RAM}}$ is the time for a round trip to main memory. The exact values are not important for this chapter. What matters is the large gap between cache hits and RAM accesses.

From the perspective of a code section, the effective access time becomes a weighted average of these levels. If a high fraction of operations hit in L1 and L2, the average is low. If many operations miss the caches and go to RAM, the average is high.

In many analyses of algorithm performance on HPC nodes, the dominant term is proportional to the number of cache misses that must reach main memory. Reducing this number often leads to large speedups.

A simple mental cost model is:

Register access: almost free
Cache hit: cheap
Cache miss to RAM: very expensive
Performance optimization often reduces to turning RAM accesses into cache hits, and keeping data in registers and caches as long as possible.

Memory Hierarchy and Parallel Execution on a Node

On a modern node, many cores share parts of the memory hierarchy. L1 and often L2 caches are private to each core, while L3 cache and main memory are shared among cores on a socket. This shared structure interacts with parallel programming models that use threads.

When multiple threads operate on data that maps to the same cache lines, contention effects and coherence traffic increase. This can reduce the effective bandwidth each thread sees and may increase latency. In extreme cases, poor data layout and access patterns cause cache thrashing, where useful data is rapidly evicted and reloaded.

Furthermore, bandwidth to main memory is finite. If many cores demand data from RAM simultaneously, they compete for this resource. The total bandwidth of the memory system may be reached before the computational capabilities of all cores are saturated.

As you learn about shared memory programming and vectorization, remember that every thread and vector lane ultimately depends on the same physical hierarchy. Scalable performance inside a node requires patterns of memory use that are friendly both to caches and to parallel access.

Implications for Algorithm Design in HPC

The existence of a memory hierarchy shapes the structure of high performance algorithms. Instead of treating memory as a uniform pool, algorithm designers explicitly organize computations to respect the hierarchy.

One common pattern is blocking or tiling. A large data structure, such as a matrix, is divided into smaller blocks that fit into a particular cache level. Computations are arranged to operate on one block at a time, reusing data as much as possible before moving on. This improves temporal locality and reduces the number of times data is loaded from lower levels.

Another pattern is reordering loops or data layouts so that innermost loops access memory contiguously. This improves spatial locality and helps the hardware prefetcher anticipate upcoming accesses. Prefetchers try to load data into caches before it is needed, but they are effective only for predictable patterns.

In HPC, the same mathematical algorithm can have very different performance depending on how it is organized around the memory hierarchy. Two implementations with identical arithmetic operations but different data access orders may differ in runtime by an order of magnitude or more.

Summary of the Memory Hierarchy’s Role in HPC

The memory hierarchy is a layered structure of storage technologies that trade off speed, size, and cost. It exists because a single uniform memory with ideal properties is not practical. For HPC programmers, the hierarchy is not just a hardware detail. It is a central part of how performance is determined.

Registers and caches feed the CPU very quickly, but they are small. Main memory is much larger but significantly slower to access. Multilevel caches, NUMA domains, and shared memory bandwidth complicate the picture further. Programs achieve high performance by maximizing locality, fitting their working sets into faster levels when possible, and arranging computations to minimize expensive transfers from lower levels.

In later chapters, you will examine registers, caches, and main memory in more detail, and see concrete examples of how to design codes that cooperate with the memory hierarchy instead of fighting it.