Kahibaro
Discord Login Register

Cache optimization

Why Cache Optimization Matters in HPC

Cache optimization is about structuring your data and computation so that:

For many HPC codes, you can get large speedups (often larger than from modest parallelization changes) purely by improving cache behavior. This is especially important because:

This chapter assumes you already understand the basic memory hierarchy and are now focusing on how to write and organize code for good cache usage.

Locality: Spatial and Temporal

Cache-friendly code exploits two kinds of locality:

Effective cache optimization tries to:

Examples:

Data Layout and Access Patterns

Array Traversal Order

In C/C++ and Fortran, multidimensional arrays are stored differently:

To get good spatial locality, your innermost loop should iterate over the dimension that is contiguous in memory.

Example (C, A[N][M] is row-major):

for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {  // j varies fastest: good
        sum += A[i][j];
    }
}

Bad order:

for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {  // i varies fastest: poor locality
        sum += A[i][j];
    }
}

This “wrong” pattern forces many cache misses because each step jumps to a distant location in memory.

In Fortran (column-major), the roles are reversed: you generally want the first index in the innermost loop.

Array of Structures vs Structure of Arrays

For data with multiple fields, the layout influences cache behavior:

  typedef struct {
      double x, y, z;
  } Point;
  Point pts[N];

Good if you always need x,y,z together, but bad if you only need x over many elements.

  typedef struct {
      double *x;
      double *y;
      double *z;
  } Points;
  Points pts;
  pts.x = malloc(N * sizeof(double));
  pts.y = malloc(N * sizeof(double));
  pts.z = malloc(N * sizeof(double));

Good for operations that touch only one or two fields at a time, allowing better spatial locality and vectorization.

In HPC, SoA (or hybrid layouts) is common when fields are processed separately (stencils, particle codes, etc.).

Contiguity and Stride

The stride of an access pattern is the step in memory between consecutive accesses.

Example (C, stride-1 vs stride-k):

// Stride-1
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}
// Stride-k
for (int i = 0; i < N; i++) {
    a[i*k] = b[i*k] + c[i*k];
}

The stride-k version touches many more cache lines for the same number of elements and defeats hardware prefetching.

Loop Transformations for Better Cache Use

Loop Interchange

Loop interchange changes the nesting order of loops to improve memory access patterns.

Example (C, 2D array, row-major):

// Original: poor for row-major
for (int j = 0; j < M; j++) {
    for (int i = 0; i < N; i++) {
        A[i][j] = ...
    }
}
// Interchanged: good for row-major
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        A[i][j] = ...
    }
}

You must ensure interchange does not change the semantics (no loop-carried dependencies that rely on order).

Loop Tiling / Blocking

Loop tiling (also called blocking) is a fundamental HPC technique: it breaks loops into smaller blocks that fit into cache, increasing temporal locality.

Consider a naive matrix multiplication (C-like pseudocode):

$$
C_{ij} = \sum_k A_{ik} B_{kj}
$$

Naive triple loop:

for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      C[i][j] += A[i][k] * B[k][j];

This touches large parts of A and B repeatedly, causing many cache misses as N grows.

Tiled version:

for (int ii = 0; ii < N; ii += Ti)
  for (int jj = 0; jj < N; jj += Tj)
    for (int kk = 0; kk < N; kk += Tk)
      for (int i = ii; i < min(ii+Ti, N); i++)
        for (int j = jj; j < min(jj+Tj, N); j++)
          for (int k = kk; k < min(kk+Tk, N); k++)
            C[i][j] += A[i][k] * B[k][j];

By carefully choosing tile sizes Ti, Tj, Tk to fit into cache, elements of A, B, and C are reused many times while they are still cached.

General advice:

Loop Fusion and Fission

Example (fusion):

// Before: two passes over array
for (int i = 0; i < N; i++)
    a[i] = f(x[i]);
for (int i = 0; i < N; i++)
    b[i] = g(x[i]);
// After: fused
for (int i = 0; i < N; i++) {
    a[i] = f(x[i]);
    b[i] = g(x[i]);
}

Fusion improves temporal locality for x[i]. However, if f and g are complex and increase register pressure, fusion can hurt performance; measure both variants.

Blocking in Higher Dimensions (Stencils)

In stencil codes (e.g., 2D/3D PDE solvers), each output element depends on neighboring points. Tiling in 2D or 3D (sometimes combined with time blocking) keeps neighborhoods in cache.

Example idea for 2D stencil (pseudocode, row-major):

for (int ii = 0; ii < N; ii += Ti)
  for (int jj = 0; jj < M; jj += Tj)
    for (int i = ii; i < min(ii+Ti, N); i++)
      for (int j = jj; j < min(jj+Tj, M); j++)
        out[i][j] = 0.25 * (in[i-1][j] + in[i+1][j]
                          + in[i][j-1] + in[i][j+1]);

Choosing Ti and Tj to keep the local tile and its halo in cache significantly reduces memory traffic.

Cache-Friendly Data Structures

Compact and Contiguous Layouts

Prefer data structures that:

Linked lists, trees, and general pointer-based structures tend to:

Alternatives:

Sparse Matrix Layouts

In HPC, sparse matrices are often stored in formats like CSR (Compressed Sparse Row):

Format choice affects cache behavior:

Mixed patterns may require more advanced formats or reordering.

Data Alignment

Although cache lines are fixed-size, misalignment can:

Practices:

Alignment is especially important when combined with SIMD/vectorization.

Managing Working Sets

Working Set and Cache Capacity

The working set is the data that must be kept “alive” at once for a part of your computation.

For good cache performance:

If your working set is larger than cache:

Cache-Friendly Algorithm Design

Some algorithms are inherently more cache-friendly than others for the same mathematical task.

Examples:

In practice, many HPC libraries (BLAS, LAPACK, etc.) implement blocked algorithms specifically to improve cache reuse.

Avoiding Cache Conflicts and Thrashing

Conflict Misses and Associativity

Caches are organized into sets, and each memory address maps to a particular set. If too many actively used data blocks map to the same set, you get conflict misses.

Symptoms:

Padding and Alignment to Reduce Conflicts

A common HPC trick is to use padding:

Example (C, 2D array flattened with padding):

int N; // logical size
int P; // padding, e.g., 1 or more
double *A = malloc((size_t)(N + P) * N * sizeof(double));
// Index as A[i*(N+P) + j]; j < N

Or pad leading dimensions in numerical libraries so successive rows don’t map to the exact same cache sets.

Choosing padding:

Measuring and Diagnosing Cache Behavior

Hardware Performance Counters

Use profiling tools that expose hardware counters such as:

Common tools (depending on system):

Indicators of cache problems:

Microbenchmarks and Simple Tests

To test cache effects of an access pattern:

These experiments help you understand how a given architecture behaves and guide your choice of blocking and data layout.

Interaction with Compilers and Vectorization

Cache optimization often synergizes with:

Hints to compilers:

Even with aggressive compiler optimizations, manual cache-aware restructuring of algorithms and data layout is often required for best HPC performance.

Practical Guidelines and Workflow

  1. Identify memory-bound kernels:
    • Use a profiler to find hotspots where CPU is stalled on memory.
  2. Inspect data access patterns:
    • Check for strides > 1, non-contiguous data structures, and pointer chasing.
  3. Apply simple fixes first:
    • Fix loop orders to match memory layout.
    • Convert obvious AoS to SoA when it helps.
    • Align data and remove unnecessary indirections.
  4. Introduce tiling/blocking:
    • Block in space (and possibly time) so the working set fits into cache.
    • Tune tile sizes empirically on target hardware.
  5. Measure and iterate:
    • Use hardware counters and wall-clock time.
    • Compare versions and keep the simplest change that gives the desired gain.
  6. Balance readability and performance:
    • Isolate performance-critical kernels where more complex, cache-optimized code is justified.
    • Document assumptions (e.g., data layout, sizes) clearly.

By systematically applying these techniques, many HPC codes can achieve substantial speedups purely from better cache utilization, often without changing algorithms or parallelization strategies.

Views: 13

Comments

Please login to add a comment.

Don't have an account? Register now!