Kahibaro
Discord Login Register

Cache

Position of cache in the memory hierarchy

Cache sits between the CPU cores and main memory. It is built from small, fast static RAM and holds copies of data and instructions that the CPU is likely to use again soon. From the point of view of a program that uses main memory, cache is almost invisible. The program still reads and writes addresses in memory, but the hardware automatically decides whether a particular access is served from a cache or must go to main memory.

Access to cache is much faster than access to RAM, and much more energy efficient. However, cache capacity is very limited compared to main memory, and cache behavior is automatic and driven by hardware policies. For HPC, this automatic behavior is a central performance factor, because it rewards certain access patterns and punishes others.

When a core reads or writes an address, the hardware checks whether the requested bytes are already present in its cache. If they are, the CPU can proceed at or near full speed. If not, it must wait for data to be brought from the next level in the hierarchy, which might be another cache level or main memory. This difference in latency from a cache hit versus a cache miss is one of the key reasons that memory access patterns matter so much in high performance code.

A cache hit occurs when requested data is found in the cache.
A cache miss occurs when requested data is not in the cache and must be fetched from a lower, slower level of memory.

Multi level cache structure

Modern CPUs use multiple levels of cache. On a typical HPC node, each core has a very small private L1 cache, a larger but slower L2 cache, and shares a still larger L3 cache with other cores on the same chip. Exact sizes and organization vary between architectures, but the basic pattern is similar.

The L1 cache is closest to the execution units of the core and usually split into separate instruction and data caches. It has very low latency and limited capacity. The L2 cache is still private to a core on most processors but larger. Its latency is higher than L1, yet much lower than main memory. The L3 cache, often called the last level cache, is typically shared by all cores on a socket. It has more capacity but higher latency again.

This hierarchy allows the hardware to exploit locality at several scales. Very recently used values are likely to be in L1. Values that were used a little earlier may still reside in L2. Data that other cores have accessed or that belongs to a longer working set may be found in L3. If none of these hold the data, the request must go to main memory. In HPC, the working sets of kernels are often sized and organized with these levels in mind, for example, by blocking matrix operations so that inner loops operate entirely within L1 or L2.

Cache lines and spatial locality

Cache does not move data in arbitrary individual bytes. Instead it operates on fixed size chunks called cache lines. A cache line is a contiguous block of memory of a fixed number of bytes, for example 64 bytes on many contemporary CPUs. When a core needs a particular address that is not present, the cache controller fetches the entire line that contains that address and stores it in the appropriate cache level.

This behavior supports what is known as spatial locality. If a program accesses one element of an array, it often soon accesses neighboring elements. By transferring a full cache line, the CPU can serve later accesses to nearby data from cache without additional main memory traffic.

For example, if a double precision element uses 8 bytes and the cache line is 64 bytes, then 8 consecutive elements fit in one cache line. Accessing an array with a unit stride, that is, visiting consecutive elements, makes good use of each fetched line. In contrast, if a loop visits every 8th element, each access may touch a different cache line and the cache will be used less effectively.

Cache transfers whole cache lines, not individual words.
Access patterns with contiguous memory accesses exploit spatial locality and make better use of each cache line.

Temporal locality and cache reuse

Temporal locality refers to the reuse of specific data within a short time window. Cache is very effective when data is accessed repeatedly in quick succession, because the data can remain in the cache long enough to serve multiple uses.

In HPC programs, many algorithms are designed to increase temporal locality. For example, in a matrix multiply, a naive implementation might read elements of one matrix only once, or read them again after they have been evicted, leading to frequent cache misses. A blocked implementation subdivides the matrices into tiles that fit in L1 or L2. Each tile is then reused in several operations while it stays hot in cache, which reduces the number of times data must be loaded from main memory.

If the working set of an inner loop is larger than the capacity of a particular cache level, the data may be evicted before it is reused. In that case, temporal locality for that level is poor, and performance is limited by lower levels or by main memory.

Cache mapping and associativity

Each cache must decide where in its internal storage to place a particular line from memory. To do this efficiently in hardware, caches use fixed mapping schemes. Fully associative caches could place a line anywhere, but they are too expensive at large sizes. Instead, practical caches are direct mapped, set associative, or somewhere in between.

In a direct mapped cache, each block of memory can reside in exactly one location in the cache. This keeps lookup simple, but if a program alternates between addresses that map to the same line, it will constantly evict and reload data, even if the cache has plenty of storage overall. This situation is called a conflict miss.

Most CPU caches today are set associative. The cache is divided into sets, and each memory block that maps to a set can be stored in any way within that set. A 4 way set associative cache, for instance, holds up to 4 cache lines in each set. This reduces conflict misses compared to a direct mapped design while still allowing fast lookup.

Although most HPC programmers do not manage mapping directly, some access patterns interact poorly with associativity and produce more conflict misses. Large arrays whose sizes align badly with cache sizes and line mappings can cause repeated eviction of active data. In practice, slightly padding arrays or rearranging loop orders can sometimes reduce such conflicts.

Types of cache misses

From a performance point of view, not all cache misses are equal. They are commonly grouped into three categories. Although hardware does not label misses this way, the classification helps to reason about performance.

Compulsory misses occur the first time a line is accessed. If your program has never touched a particular address, the line cannot already be in the cache. Such misses are inherent, but their cost can be reduced by careful ordering or by using prefetch mechanisms that overlap latency with computation.

Capacity misses occur when the cache cannot hold all the data that the program needs over a certain time period. Even with perfect replacement policies, if the working set is larger than the cache, lines will be evicted and later needed again. Blocking and tiling techniques in HPC codes often aim to reduce capacity misses by shrinking the active working set to fit in L1, L2, or L3.

Conflict misses occur when different addresses compete for the same set in a set associative cache, or the same line in a direct mapped cache, even though the overall cache has room. Certain regular strides through memory can trigger conflict misses. Adjusting array dimensions or offsets to avoid pathological alignments is a common trick in performance tuning.

Cache misses are often grouped into compulsory, capacity, and conflict misses.
Reducing capacity and conflict misses is a major target of cache conscious optimization.

Cache replacement policies

When a new line must be loaded and the relevant set in the cache is already full, the cache hardware must decide which existing line to evict. This is managed by a replacement policy. Common theoretical policies include Least Recently Used, First In First Out, or random replacement. In real hardware, exact policies are often approximations that are efficient to implement.

Least Recently Used tries to remove the line that has not been accessed for the longest time, on the assumption that recently used lines are likely to be used again. This matches programs with strong temporal locality. However, true LRU is expensive to track at every access in a large cache, so approximations are typical.

From an HPC standpoint, you cannot directly choose the replacement policy, but you can write code that works with it. Streaming through huge arrays in a single pass, with little reuse, will quickly evict previously loaded data because the hardware must make space. Loops that reuse data in a structured way exploit the built in preference for recently used lines.

Inclusive, exclusive, and shared caches

Multi level caches may have different policies for how lines are distributed across levels. In an inclusive hierarchy, any line that is present in a higher level cache, such as L1, is also stored in the lower level, such as L2 or L3. This simplifies some aspects of coherence but reduces effective total capacity, because parts of L2 or L3 are essentially duplicates of what already resides in L1.

In an exclusive design, a line is present in only one level at a time. If it moves from L2 into L1, it is removed from L2, so the total capacity across levels adds up more fully. There are also non inclusive designs that mix properties of both.

From a programmer’s point of view, what matters is the effective data capacity that can be active at once for a core or for a socket. HPC documentation for specific architectures often mentions whether caches are inclusive or exclusive. When tuning bandwidth intensive kernels, you may see performance differences that reflect how data moves between levels and how much duplication occurs.

In multi core processors, L3 is often shared by all cores on a socket. This shared last level cache acts as a buffer between cores and memory. It can reduce bandwidth demands on main memory when different cores reuse similar data, such as shared read only tables. However, it also creates contention. One core’s heavy use of L3 can evict lines that others might need.

Cache coherence in multi core systems

In multi core systems, each core has its own private caches at least at L1 and often L2. When multiple cores read and write shared data, the hardware must keep their views consistent. This property is called cache coherence. Coherence protocols ensure that when one core writes to a memory location, other cores do not continue to read stale values from their caches.

The details of coherence protocols are complex, but the practical effect is that writes to shared data can cause extra cache traffic. When a core writes to a line that is also cached by other cores, it may trigger invalidation messages or line transfers. These coherence transactions consume bandwidth and can stall cores if they need to wait for ownership of a line.

In HPC codes that use shared memory programming, certain patterns aggravate coherence overhead. For example, if many threads repeatedly update adjacent elements that reside in the same cache line, they can contend for ownership of that line. This phenomenon is often called false sharing, because the threads might logically work on different variables, but the cache treats the line as a unit.

Avoiding unnecessary sharing of writable cache lines between cores is an important practical guideline. Techniques include padding data structures so that frequently written variables sit in distinct lines, aligning arrays appropriately, and grouping thread local data so that each thread modifies mostly private lines.

Cache and vectorization

Vectorization, which is treated in more detail elsewhere, refers to the use of SIMD instructions that operate on multiple data elements within a single instruction. Cache behavior and vectorization are closely related. Effective vectorization requires that the CPU can feed the vector units with enough data, which in turn requires that data arrives from cache quickly and is laid out contiguously.

When data is stored in memory in a structure of arrays layout, with each field in a separate contiguous array, vectorized loops can load entire cache lines filled with useful elements for a given operation. In contrast, array of structures layouts can interleave fields that are not all used together, which may cause each cache line to contain only a subset of useful data for a particular pass.

Cache lines are also the unit over which streaming and prefetching are designed. Many compilers and processors automatically issue prefetches for sequential access patterns. These prefetches aim to bring lines into cache before the vector instruction needs them, so that latency is hidden.

For HPC, loops that combine stride 1 access, enough arithmetic per loaded element, and data sets that fit within the effective cache capacities tend to achieve the highest performance. When loops use irregular patterns, rely heavily on indirect addressing, or mix multiple large arrays with poor alignment, the cache may become the primary bottleneck rather than the floating point units.

Simple cache performance model

For reasoning at a high level, it can be useful to think of cache behavior in terms of average memory access time. A simple model considers the latency of each level and the probabilities of hitting or missing at each level.

If $T_{L1}$ is the access time when data is found in L1, $T_{L2}$ is the additional time to access L2 on an L1 miss, and $T_{mem}$ is the further time to access memory on an L2 miss, and if $h_1$ is the hit rate in L1 and $h_2$ is the hit rate in L2 given an L1 miss, then a rough estimate of average access time $T_{avg}$ is

$$
T_{avg} \approx T_{L1}
+ (1 - h_1) \, T_{L2}
+ (1 - h_1) (1 - h_2) \, T_{mem}.
$$

Average memory access time is lowered by increasing cache hit rates and reducing misses at higher, more expensive levels.

While actual processors overlap many operations and use out of order execution to hide latency, this model is still useful for intuition. If a change to an algorithm improves L1 and L2 hit rates, then even if individual operations are unchanged, overall performance can increase substantially because fewer accesses pay the full main memory cost.

Practical cache aware thinking for HPC beginners

At an introductory level, the goal is not to memorize microarchitectural details but to develop a mental habit of considering cache whenever you think about performance.

When you look at a loop, ask whether accesses are contiguous in memory or jumping around. Sequential traversals of arrays are more cache friendly than random accesses. When an algorithm uses data that conceptually fits into a small region, consider whether that region can be structured to fit into a particular cache level. Blocking and tiling strategies often come from this line of thinking.

When parallelizing on many cores, be aware that each core has its own private caches and that they share a last level cache and main memory. If you give each thread its own slice of data that fits in its private caches, you reduce interference. If every thread repeatedly touches the same shared data, cache coherence and contention can dominate.

Caches are automatic, but their rules are regular enough that programmers can take advantage of them. Over time, inspecting performance counters, reading architecture guides, and experimenting with small kernels will sharpen your intuition about how caches behave on the specific HPC systems you use.

Views: 3

Comments

Please login to add a comment.

Don't have an account? Register now!