Kahibaro
Discord Login Register

Cache

What Cache Is in the Memory Hierarchy

Cache is a small, very fast type of memory located close to the CPU cores. Its purpose is to keep copies of the data and instructions that the CPU is likely to use soon, so they can be accessed much more quickly than from main memory (RAM).

In the memory hierarchy, cache sits between CPU registers and main memory, trading capacity for speed:

In HPC, understanding cache behavior is crucial because many performance problems are really “cache problems”: the CPU spends time waiting for data from memory rather than doing computations.

Cache Levels: L1, L2, L3

Modern CPUs usually have multiple levels of cache, each with different size and speed:

Some architectures add L4 cache or on-package high-bandwidth memory, but the basic idea is always the same:
small & fast near the cores, larger & slower farther away.

Why Cache Matters for Performance

Access time grows dramatically as you move down the hierarchy:

If data is found in cache (a cache hit), the CPU keeps running quickly.
If data is not in cache (a cache miss), the CPU must wait while the data is brought from a lower level (L2, L3, or RAM).

For many HPC workloads, performance is limited not by floating-point operations but by how efficiently data is moved through caches. Two codes with identical arithmetic can differ hugely in speed because one is cache-friendly and the other is not.

Cache Lines and Spatial Locality

Caches move data between levels in fixed-size chunks called cache lines (or cache blocks). A typical cache line size is 64 bytes, though it can vary by architecture.

When the CPU needs a particular memory address:

  1. The cache checks if the address is in one of its lines.
  2. If not, the cache loads the entire line containing that address from the next lower level.

This design exploits spatial locality: the expectation that if you access one location in memory, you’re likely to soon access nearby locations.

For example, for a double array (8 bytes per element) with 64-byte lines, a single cache line holds 8 consecutive elements. If your code iterates through the array sequentially, each cache line brings in useful data for multiple future accesses.

Implication for HPC code:
Algorithms that access memory in a contiguous, predictable order allow caches to be used efficiently, because each fetched cache line contains many useful values.

Temporal Locality and Reuse

Caches also exploit temporal locality: if you use a piece of data now, you are likely to use it again soon.

When data is loaded into cache, it tends to stay there for some time (until it is evicted). If the same data is accessed again before eviction, the access is served quickly from cache.

Good HPC codes increase data reuse:

Cache Hits, Misses, and Penalties

Each memory access falls into one of these categories at each cache level:

A miss at L1 may hit in L2, or miss again and be found in L3, and so on.

The miss penalty is the extra time required to fetch data from the next level. When a miss penalty is high and frequent, your code becomes memory-bound instead of compute-bound.

Common miss types (conceptually):

Cache Mapping and Associativity (Conceptual)

A cache must decide where in the cache a given memory address can be stored. The main schemes:

You usually don’t control associativity directly in code, but access patterns can create or avoid conflict misses. Highly regular strides that repeatedly hit the same sets may lead to more conflicts.

Inclusive vs Exclusive Caches

Relationship between levels matters:

On many mainstream x86 systems, last-level cache tends to be inclusive or partially inclusive; some other architectures use exclusive designs.
As an HPC programmer, you generally don’t choose this, but it affects how large a “working set” you can expect to fit efficiently.

Cache Coherence (High-Level View)

On multicore CPUs, each core has its own private caches (L1 and often L2). If two cores work on shared data, their caches must stay logically consistent. This is maintained by a cache-coherence protocol (e.g., MESI and variants).

Key practical implications for HPC:

Detailed thread and synchronization behavior is covered in shared-memory programming topics; here, it’s enough to recognize that cache coherence exists and can impact performance.

Typical Cache Sizes and Latencies (Order of Magnitude)

Exact numbers differ by CPU generation, but the pattern is stable:

For HPC, this means:

Cache-Friendly vs Cache-Unfriendly Access Patterns

Simple, abstract examples:

In HPC, you often reorganize data structures and loops specifically to improve cache locality (e.g., changing array order, loop order, or using blocking).

Measuring and Observing Cache Behavior

Many performance tools provide cache-related metrics such as:

While details of tools belong elsewhere, it’s important to know that:

Summary: What to Remember About Cache for HPC

Views: 14

Comments

Please login to add a comment.

Don't have an account? Register now!