Kahibaro
Discord Login Register

Memory hierarchy

What the Memory Hierarchy Is About

The memory hierarchy is the layered structure of storage components in a computer system, arranged by speed, cost, and capacity. In HPC, understanding this hierarchy is crucial because:

At a high level, you have:

This chapter focuses on how these layers relate to each other and why this matters for HPC performance, leaving detailed internals of each specific level to their own subsections.

Key Trade-Offs: Speed, Capacity, Cost, and Distance

The memory hierarchy exists because no single kind of memory can simultaneously be:

Designers therefore combine multiple levels with different characteristics:

Conceptually:

As you move away from the CPU:

Typical Memory Hierarchy Levels

A simplified CPU-centric view looks like:

  1. Registers (closest, fastest, smallest)
  2. L1 cache (small, very fast)
  3. L2 cache (bigger, slightly slower)
  4. L3 cache (big shared cache on many CPUs)
  5. Main memory (RAM) (much bigger, slower)
  6. Non-volatile storage (SSD/HDD; typically considered in I/O chapters)

You’ll see some variations, but this pattern—multiple cache levels plus RAM—is standard in modern HPC systems.

The Memory Wall

In many workloads, CPU speed increases faster than memory speed. This leads to the memory wall:

In practice this means:

For HPC, managing where your data lives and how it moves through the hierarchy can matter more than the number of cores or clock speed alone.

Temporal and Spatial Locality

The hierarchy works well because many programs exhibit locality:

Caches and other parts of the hierarchy are designed to exploit this:

For HPC performance:

Latency vs Bandwidth

Two distinct performance aspects of memory:

Different levels in the hierarchy balance these differently:

HPC codes can be:

Optimizations often target whichever is the main bottleneck.

Effective vs Theoretical Bandwidth

Theoretical peak bandwidth is given by:

$$
\text{Peak bandwidth} = \text{Bus width} \times \text{Transfer rate} \times \text{Number of channels}
$$

In practice, your effective bandwidth is lower due to:

HPC benchmarks (e.g. STREAM) measure sustainable memory bandwidth and give a more realistic picture of what your code can achieve.

Multi-Level Caches and Sharing

Modern CPUs often have:

Implications for HPC:

Knowing which data is truly shared and how it’s laid out in memory affects how efficiently the hierarchy and sharing mechanisms work.

Bandwidth vs Core Count

As core counts per socket grow, per-core memory bandwidth can drop if memory bandwidth does not increase proportionally:

Result:

This is a direct consequence of how multiple cores share the memory hierarchy.

Implications for Parallel HPC Codes

The memory hierarchy affects parallel performance at multiple levels:

Typical high-level implications:

Roofline Perspective

A common way to relate computation and memory hierarchy is the roofline model, which plots:

Given a memory bandwidth $B$ (bytes/s) and peak compute $P$ (FLOP/s):

$$
P_{\text{mem}} = B \times I
$$

$$
P_{\text{achievable}} = \min(P,\, B \times I)
$$

This shows directly how the memory hierarchy (through $B$) sets a ceiling on performance for memory-intensive codes.

Practical Takeaways for Beginners

At this stage, you don’t need to memorize hardware details, but keep these ideas in mind:

Following chapters on registers, caches, and main memory will look at specific levels in more detail and connect them to concrete programming practices.

Views: 15

Comments

Please login to add a comment.

Don't have an account? Register now!