Table of Contents
What Cache Is in the Memory Hierarchy
Cache is a small, very fast type of memory located close to the CPU cores. Its purpose is to keep copies of the data and instructions that the CPU is likely to use soon, so they can be accessed much more quickly than from main memory (RAM).
In the memory hierarchy, cache sits between CPU registers and main memory, trading capacity for speed:
- Registers: tiny, fastest, directly in the core
- Cache: small, very fast, on or near the CPU
- Main memory (RAM): large, much slower than cache
In HPC, understanding cache behavior is crucial because many performance problems are really “cache problems”: the CPU spends time waiting for data from memory rather than doing computations.
Cache Levels: L1, L2, L3
Modern CPUs usually have multiple levels of cache, each with different size and speed:
- L1 cache
- Smallest (e.g., 32–64 KB per core for data, similar for instructions)
- Fastest (few CPU cycles of latency)
- Often split into:
- L1D (data cache)
- L1I (instruction cache)
- L2 cache
- Larger (e.g., 256 KB–1 MB per core)
- Slower than L1, faster than L3/RAM
- Usually private to each core
- L3 cache (last-level cache)
- Much larger (e.g., several MB to tens of MB shared among cores on a socket)
- Slower than L2 but still much faster than RAM
- Often shared across all cores in a CPU package
Some architectures add L4 cache or on-package high-bandwidth memory, but the basic idea is always the same:
small & fast near the cores, larger & slower farther away.
Why Cache Matters for Performance
Access time grows dramatically as you move down the hierarchy:
- L1: ~1–4 cycles
- L2: ~5–15 cycles
- L3: ~20–50 cycles
- RAM: ~100+ cycles (can be much more)
- Storage: orders of magnitude slower again
If data is found in cache (a cache hit), the CPU keeps running quickly.
If data is not in cache (a cache miss), the CPU must wait while the data is brought from a lower level (L2, L3, or RAM).
For many HPC workloads, performance is limited not by floating-point operations but by how efficiently data is moved through caches. Two codes with identical arithmetic can differ hugely in speed because one is cache-friendly and the other is not.
Cache Lines and Spatial Locality
Caches move data between levels in fixed-size chunks called cache lines (or cache blocks). A typical cache line size is 64 bytes, though it can vary by architecture.
When the CPU needs a particular memory address:
- The cache checks if the address is in one of its lines.
- If not, the cache loads the entire line containing that address from the next lower level.
This design exploits spatial locality: the expectation that if you access one location in memory, you’re likely to soon access nearby locations.
For example, for a double array (8 bytes per element) with 64-byte lines, a single cache line holds 8 consecutive elements. If your code iterates through the array sequentially, each cache line brings in useful data for multiple future accesses.
Implication for HPC code:
Algorithms that access memory in a contiguous, predictable order allow caches to be used efficiently, because each fetched cache line contains many useful values.
Temporal Locality and Reuse
Caches also exploit temporal locality: if you use a piece of data now, you are likely to use it again soon.
When data is loaded into cache, it tends to stay there for some time (until it is evicted). If the same data is accessed again before eviction, the access is served quickly from cache.
Good HPC codes increase data reuse:
- Reuse values in registers and caches as many times as possible before moving on.
- Structure loops so that data brought into the cache is used multiple times before being evicted.
Cache Hits, Misses, and Penalties
Each memory access falls into one of these categories at each cache level:
- Hit: data is found in the cache
- Miss: data is not in the cache and must be fetched from a lower level
A miss at L1 may hit in L2, or miss again and be found in L3, and so on.
The miss penalty is the extra time required to fetch data from the next level. When a miss penalty is high and frequent, your code becomes memory-bound instead of compute-bound.
Common miss types (conceptually):
- Cold (compulsory) misses: first-time accesses to data never seen before.
- Capacity misses: the working set is larger than the cache; data is evicted before it is reused.
- Conflict misses: even though there is enough total space, the mapping of addresses to cache lines causes evictions (related to cache associativity; see below).
Cache Mapping and Associativity (Conceptual)
A cache must decide where in the cache a given memory address can be stored. The main schemes:
- Direct-mapped cache
- Each memory address maps to exactly one cache line position.
- Simple and fast, but prone to conflict misses.
- Fully associative cache
- Any address can be stored in any cache line.
- Few conflict misses, but complex and slower lookups.
- Set-associative cache (common compromise)
- Cache is divided into sets; each memory address maps to one set.
- Within that set, the address can occupy any of a fixed number of “ways” (e.g., 8-way set-associative).
- Balances hardware complexity and miss rate.
You usually don’t control associativity directly in code, but access patterns can create or avoid conflict misses. Highly regular strides that repeatedly hit the same sets may lead to more conflicts.
Inclusive vs Exclusive Caches
Relationship between levels matters:
- Inclusive cache hierarchy
- Data in L1 is also present in L2/L3.
- Makes some coherence operations simpler, but can waste capacity.
- Exclusive cache hierarchy
- Data is present in only one level at a time (e.g., in L1 or L2, but not both).
- Increases total effective capacity, but may require more complex movement.
On many mainstream x86 systems, last-level cache tends to be inclusive or partially inclusive; some other architectures use exclusive designs.
As an HPC programmer, you generally don’t choose this, but it affects how large a “working set” you can expect to fit efficiently.
Cache Coherence (High-Level View)
On multicore CPUs, each core has its own private caches (L1 and often L2). If two cores work on shared data, their caches must stay logically consistent. This is maintained by a cache-coherence protocol (e.g., MESI and variants).
Key practical implications for HPC:
- False sharing: if two threads update different variables that happen to be on the same cache line, they can cause unnecessary coherence traffic and slowdowns, even though they don’t logically share data.
- Shared read-only data is cheap to replicate in caches.
- Frequent writes to shared data can create contention among cores.
Detailed thread and synchronization behavior is covered in shared-memory programming topics; here, it’s enough to recognize that cache coherence exists and can impact performance.
Typical Cache Sizes and Latencies (Order of Magnitude)
Exact numbers differ by CPU generation, but the pattern is stable:
- L1: tens of KB per core, single-digit cycles latency
- L2: hundreds of KB to ~1 MB per core, ~10 cycles
- L3: several MB to tens of MB shared, tens of cycles
- RAM: gigabytes, hundreds of cycles
For HPC, this means:
- Fitting tight inner loops’ working set into L1 or L2 can drastically boost performance.
- L3 often serves as a “shock absorber” between fast cores and slower RAM, but relying heavily on L3 is still slower than staying in L1/L2.
- Exceeding cache capacity with a large working set typically leads to a sharp performance drop.
Cache-Friendly vs Cache-Unfriendly Access Patterns
Simple, abstract examples:
- Cache-friendly:
- Iterating linearly over arrays:
for (i = 0; i < N; ++i) a[i] = b[i] + c[i]; - Blocking/tiling data so that chunks fit in cache and are reused before moving on.
- Cache-unfriendly:
- Large, irregular pointer-chasing (e.g., following a linked list scattered across memory).
- Accesses with large strides that skip many elements, wasting most of each cache line.
- Random access patterns that defeat spatial and temporal locality.
In HPC, you often reorganize data structures and loops specifically to improve cache locality (e.g., changing array order, loop order, or using blocking).
Measuring and Observing Cache Behavior
Many performance tools provide cache-related metrics such as:
- Cache hit/miss rates at different levels
- Stall cycles due to memory
- Bandwidth used between caches and memory
While details of tools belong elsewhere, it’s important to know that:
- Poor cache behavior manifests as low CPU utilization and high memory stall times.
- Improvement strategies include changing data layout, loop structure, and access patterns to better exploit caches.
Summary: What to Remember About Cache for HPC
- Cache is a small, fast memory layer between CPU and RAM.
- Data moves in cache lines, enabling spatial locality.
- Reusing data soon after it is loaded exploits temporal locality.
- Multiple cache levels (L1, L2, L3) trade off size and speed.
- Poor cache usage leads to cache misses, long memory stalls, and memory-bound performance.
- HPC codes that are organized around locality and reuse can achieve much higher performance without changing the underlying algorithms.