12.5 Cache optimization

Why Cache Optimization Matters in HPC

Cache optimization is about structuring your data and computation so that:

Data is used while it is still “hot” in cache.
You minimize expensive transfers between memory levels (L1/L2/L3 ↔ RAM).
You exploit spatial and temporal locality.

For many HPC codes, you can get large speedups (often larger than from modest parallelization changes) purely by improving cache behavior. This is especially important because:

CPU–RAM bandwidth and latency are often the bottleneck.
Modern CPUs have deep cache hierarchies and hardware prefetchers that reward regular access patterns.

This chapter assumes you already understand the basic memory hierarchy and are now focusing on how to write and organize code for good cache usage.

Locality: Spatial and Temporal

Cache-friendly code exploits two kinds of locality:

Spatial locality: When you access an element, there is a high chance you will soon access neighboring elements in memory (same cache line or nearby).
Temporal locality: When you access an element, you will access the same element again soon.

Effective cache optimization tries to:

Arrange data so consecutive operations touch nearby memory (spatial).
Reuse data as much as possible before it is evicted from cache (temporal).

Examples:

Iterating over an array linearly (i = 0,1,2,…) has good spatial locality.
Keeping a small working set in inner loops and reusing it repeatedly has good temporal locality.

Data Layout and Access Patterns

Array Traversal Order

In C/C++ and Fortran, multidimensional arrays are stored differently:

C/C++: row-major (last index varies fastest in memory).
Fortran: column-major (first index varies fastest).

To get good spatial locality, your innermost loop should iterate over the dimension that is contiguous in memory.

Example (C, A[N][M] is row-major):

for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {  // j varies fastest: good
        sum += A[i][j];
    }
}

Bad order:

for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {  // i varies fastest: poor locality
        sum += A[i][j];
    }
}

This “wrong” pattern forces many cache misses because each step jumps to a distant location in memory.

In Fortran (column-major), the roles are reversed: you generally want the first index in the innermost loop.

Array of Structures vs Structure of Arrays

For data with multiple fields, the layout influences cache behavior:

Array of Structures (AoS):

  typedef struct {
      double x, y, z;
  } Point;
  Point pts[N];

Good if you always need x,y,z together, but bad if you only need x over many elements.

Structure of Arrays (SoA):

  typedef struct {
      double *x;
      double *y;
      double *z;
  } Points;
  Points pts;
  pts.x = malloc(N * sizeof(double));
  pts.y = malloc(N * sizeof(double));
  pts.z = malloc(N * sizeof(double));

Good for operations that touch only one or two fields at a time, allowing better spatial locality and vectorization.

In HPC, SoA (or hybrid layouts) is common when fields are processed separately (stencils, particle codes, etc.).

Contiguity and Stride

The stride of an access pattern is the step in memory between consecutive accesses.

Stride-1 (contiguous) is ideal.
Large, irregular, or power-of-two strides often hurt performance.

Example (C, stride-1 vs stride-k):

// Stride-1
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}
// Stride-k
for (int i = 0; i < N; i++) {
    a[i*k] = b[i*k] + c[i*k];
}

The stride-k version touches many more cache lines for the same number of elements and defeats hardware prefetching.

Loop Transformations for Better Cache Use

Loop Interchange

Loop interchange changes the nesting order of loops to improve memory access patterns.

Example (C, 2D array, row-major):

// Original: poor for row-major
for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {
        A[i][j] = ...
    }
}
// Interchanged: good for row-major
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        A[i][j] = ...
    }
}

You must ensure interchange does not change the semantics (no loop-carried dependencies that rely on order).

Loop Tiling / Blocking

Loop tiling (also called blocking) is a fundamental HPC technique: it breaks loops into smaller blocks that fit into cache, increasing temporal locality.

Consider a naive matrix multiplication (C-like pseudocode):

$$
C_{ij} = \sum_k A_{ik} B_{kj}
$$

Naive triple loop:

for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      C[i][j] += A[i][k] * B[k][j];

This touches large parts of A and B repeatedly, causing many cache misses as N grows.

Tiled version:

for (int ii = 0; ii < N; ii += Ti)
  for (int jj = 0; jj < N; jj += Tj)
    for (int kk = 0; kk < N; kk += Tk)
      for (int i = ii; i < min(ii+Ti, N); i++)
        for (int j = jj; j < min(jj+Tj, N); j++)
          for (int k = kk; k < min(kk+Tk, N); k++)
            C[i][j] += A[i][k] * B[k][j];

By carefully choosing tile sizes Ti, Tj, Tk to fit into cache, elements of A, B, and C are reused many times while they are still cached.

General advice:

Tile so the working set of the innermost loops fits into L1 or L2 cache.
Consider both data size and cache line size when picking tile sizes.
Test and tune; optimal tile sizes are architecture-dependent.

Loop Fusion and Fission

Loop fusion: combine multiple loops that iterate over the same range, improving temporal locality.
Loop fission (distribution): split one large loop into smaller loops if it allows better caching or vectorization.

Example (fusion):

// Before: two passes over array
for (int i = 0; i < N; i++)
    a[i] = f(x[i]);
for (int i = 0; i < N; i++)
    b[i] = g(x[i]);
// After: fused
for (int i = 0; i < N; i++) {
    a[i] = f(x[i]);
    b[i] = g(x[i]);
}

Fusion improves temporal locality for x[i]. However, if f and g are complex and increase register pressure, fusion can hurt performance; measure both variants.

Blocking in Higher Dimensions (Stencils)

In stencil codes (e.g., 2D/3D PDE solvers), each output element depends on neighboring points. Tiling in 2D or 3D (sometimes combined with time blocking) keeps neighborhoods in cache.

Example idea for 2D stencil (pseudocode, row-major):

for (int ii = 0; ii < N; ii += Ti)
  for (int jj = 0; jj < M; jj += Tj)
    for (int i = ii; i < min(ii+Ti, N); i++)
      for (int j = jj; j < min(jj+Tj, M); j++)
        out[i][j] = 0.25 * (in[i-1][j] + in[i+1][j]
                          + in[i][j-1] + in[i][j+1]);

Choosing Ti and Tj to keep the local tile and its halo in cache significantly reduces memory traffic.

Cache-Friendly Data Structures

Compact and Contiguous Layouts

Prefer data structures that:

Use few pointers.
Keep frequently accessed data in contiguous arrays.

Linked lists, trees, and general pointer-based structures tend to:

Scatter nodes across memory.
Defeat spatial locality and prefetching.

Alternatives:

Use arrays of indices instead of pointers (e.g., CSR/CSC formats for sparse matrices).
Use flat arrays with computed offsets instead of nested pointer structures.

Sparse Matrix Layouts

In HPC, sparse matrices are often stored in formats like CSR (Compressed Sparse Row):

Three arrays: values, col_index, row_ptr.
Rows are contiguous, improving locality when traversing row by row.

Format choice affects cache behavior:

Row-major traversals → CSR.
Column-major-like traversals → CSC.

Mixed patterns may require more advanced formats or reordering.

Data Alignment

Although cache lines are fixed-size, misalignment can:

Cause loads/stores to cross cache line boundaries.
Hurt vectorization.

Practices:

Align arrays to cache line or vector boundaries (e.g., 64 bytes).
Use compiler-specific attributes or allocation functions (e.g., posix_memalign, _mm_malloc, or alignas(64) in C++).

Alignment is especially important when combined with SIMD/vectorization.

Managing Working Sets

Working Set and Cache Capacity

The working set is the data that must be kept “alive” at once for a part of your computation.

For good cache performance:

The working set of hot inner loops should be smaller than the capacity of a suitable cache level (e.g., L1 or L2).
Tiling is how you reduce the working set size without changing the problem.

If your working set is larger than cache:

Data will be constantly evicted and reloaded.
Performance becomes bound by memory bandwidth and latency.

Cache-Friendly Algorithm Design

Some algorithms are inherently more cache-friendly than others for the same mathematical task.

Examples:

Blocked vs non-blocked matrix algorithms (e.g., LU/QR factorizations).
Recursive algorithms that operate on contiguous subarrays (divide and conquer) often have better locality than simple iterative ones.

In practice, many HPC libraries (BLAS, LAPACK, etc.) implement blocked algorithms specifically to improve cache reuse.

Avoiding Cache Conflicts and Thrashing

Conflict Misses and Associativity

Caches are organized into sets, and each memory address maps to a particular set. If too many actively used data blocks map to the same set, you get conflict misses.

Symptoms:

Performance drops dramatically when array sizes hit certain multiples (e.g., powers of two).
Changing array sizes or padding fixes it without other changes.

Padding and Alignment to Reduce Conflicts

A common HPC trick is to use padding:

Add extra elements to rows or arrays to change their alignment and mapping to cache sets.
Break harmful patterns where different arrays or rows always collide in cache.

Example (C, 2D array flattened with padding):

int N; // logical size
int P; // padding, e.g., 1 or more
double *A = malloc((size_t)(N + P) * N * sizeof(double));
// Index as A[i*(N+P) + j]; j < N

Or pad leading dimensions in numerical libraries so successive rows don’t map to the exact same cache sets.

Choosing padding:

Often small (1–16 elements) is enough.
Must be guided by measurement.

Measuring and Diagnosing Cache Behavior

Hardware Performance Counters

Use profiling tools that expose hardware counters such as:

L1/L2/L3 hit/miss counts.
Cache miss rates.
Memory bandwidth usage.

Common tools (depending on system):

perf (Linux).
Vendor tools (e.g., Intel VTune, AMD uProf, NVIDIA Nsight for GPU-side caches).
PAPI-based tools or wrappers.

Indicators of cache problems:

High L1/L2/L3 miss rates.
Performance dominated by memory stalls.
Low achieved FLOP/s relative to theoretical peak but high memory bandwidth usage (memory-bound behavior).

Microbenchmarks and Simple Tests

To test cache effects of an access pattern:

Write small kernels that operate on arrays of varying size.
Increase the size and measure runtimes; look for “knees” where performance drops as you exceed L1, then L2, etc.
Compare row-major vs column-major traversals.

These experiments help you understand how a given architecture behaves and guide your choice of blocking and data layout.

Interaction with Compilers and Vectorization

Cache optimization often synergizes with:

Vectorization: contiguous, stride-1 patterns are easier to auto-vectorize, and vectorized code consumes data in cache-line-friendly chunks.
Loop transformations by the compiler: many compilers can do loop interchange, unrolling, or blocking themselves, but only for simple cases and when allowed by dependencies.

Hints to compilers:

Use restrict for pointers in C/C++ when valid to indicate no aliasing.
Keep loop bodies simple and predictable.
Avoid unnecessary function calls inside tight loops (or inline them).

Even with aggressive compiler optimizations, manual cache-aware restructuring of algorithms and data layout is often required for best HPC performance.

Practical Guidelines and Workflow

Identify memory-bound kernels:

Use a profiler to find hotspots where CPU is stalled on memory.

Inspect data access patterns:

Check for strides > 1, non-contiguous data structures, and pointer chasing.

Apply simple fixes first:

Fix loop orders to match memory layout.
Convert obvious AoS to SoA when it helps.
Align data and remove unnecessary indirections.

Introduce tiling/blocking:

Block in space (and possibly time) so the working set fits into cache.
Tune tile sizes empirically on target hardware.

Measure and iterate:

Use hardware counters and wall-clock time.
Compare versions and keep the simplest change that gives the desired gain.

Balance readability and performance:

Isolate performance-critical kernels where more complex, cache-optimized code is justified.
Document assumptions (e.g., data layout, sizes) clearly.

By systematically applying these techniques, many HPC codes can achieve substantial speedups purely from better cache utilization, often without changing algorithms or parallelization strategies.

Comments

Please login to add a comment.

Don't have an account? Register now!