12.5 Cache optimization

Table of Contents

Locality and Why Caches Matter for Performance

Caches matter for performance because they hide the large latency gap between the CPU and main memory. The parent chapter has already introduced the general idea of performance analysis and the memory hierarchy. In this chapter we concentrate on how to write and organize code so that it uses caches effectively.

Two related concepts drive most cache optimizations: spatial locality and temporal locality.

Spatial locality describes how often a program accesses data that are stored close together in memory. Since cache lines fetch a contiguous block of bytes at a time, accessing consecutive elements of an array is much cheaper than jumping around randomly.

Temporal locality describes how often a program reuses the same data within a short time. If a value is used repeatedly before it is evicted from cache, those later uses are much faster.

Most practical cache optimizations are ways of increasing spatial and temporal locality. In HPC, where large arrays dominate, improving locality can be the difference between an application that is memory bound and one that approaches peak CPU performance.

Good cache optimization maximizes spatial locality and temporal locality. Poor locality leads to a memory bound code, even if the algorithm itself is efficient.

Access Patterns and Stride

Cache lines are fetched in fixed sized chunks, for example 64 bytes on many CPUs. When you access a[i], the processor loads the entire cache line that contains a[i]. If you soon access a[i+1], it is already in the cache. If you instead jump to a[i+1000], you likely trigger a new cache line load and possibly evict useful data.

This behavior can be described in terms of stride. If you access elements of an array using indices that differ by a constant $k$, the stride is $k$ elements. A stride of 1 means you visit consecutive elements. Large strides reduce spatial locality and increase the number of cache lines that must be fetched per unit of useful work.

In many scientific codes, poor performance comes from loops with large effective stride. For example, in C where arrays are row major, iterating over a 2D array with the column index changed in the inner loop can produce a stride equal to the leading dimension, which is often very large. Reversing the loop order to match the memory layout reduces the stride to one and often gives a dramatic speedup.

The same data structure can therefore behave as either cache friendly or cache hostile, purely depending on access order. Cache optimization often begins with careful examination of loop nests to ensure the innermost loop traverses memory in the order it is laid out.

Loop Transformations for Cache Behavior

Loop transformations are mechanical changes to loops that preserve the mathematical result but alter the memory access pattern. Several standard transformations are especially important for cache optimization.

Loop interchange swaps the order of nested loops. This is particularly useful for multidimensional arrays. The general idea is to make the inner loop index the dimension that is contiguous in memory. In C this is usually the last index, while in Fortran it is usually the first. Interchanging loops does not change the arithmetic but can completely change the cache hit rate.

Loop fusion combines separate loops that touch the same data into a single loop, so that each element is loaded into cache once and then used for multiple operations before being evicted. This improves temporal locality. However, fusion can also increase register and cache pressure if the combined loop uses many arrays at once, so it needs to be used judiciously.

Loop fission, also called loop distribution, splits one loop into several. This can reduce the working set size of each loop, which can be helpful when the original loop touches many arrays or a large region of memory simultaneously. A smaller working set is more likely to fit in cache.

Loop unrolling replicates the loop body several times and reduces the number of iterations. This can increase instruction level parallelism and enable better use of SIMD, which indirectly affects cache behavior by letting the CPU do more work on the data already present in registers and cache. However, excessive unrolling can increase code size and potentially hurt instruction cache behavior.

Most compilers can perform some of these transformations automatically when optimization flags are enabled, but they are conservative and may miss opportunities. Explicitly writing loops to follow good access patterns usually yields better and more predictable cache usage.

When optimizing loops for cache behavior, always adjust loop order so that the innermost loop walks contiguous memory, and consider fusion or fission to control the per loop working set size.

Working Set Size and Cache Capacity

Caches are not only about access pattern, but also about how much data you touch within a time window. The total data that must be kept active to perform a part of the computation is often called the working set. If your working set fits in cache, you can achieve very high bandwidth and low latency. If it does not, the CPU continually evicts and reloads lines, causing cache thrashing.

For example, consider a triple nested loop implementing a matrix operation. If you iterate over the full rows and columns in the natural way, the working set of a single outer loop iteration might be larger than the L1 or even L2 cache. Each inner loop access may then miss in cache and go to main memory, even if there is good spatial locality along a single row.

Cache optimization in this context often aims to reduce the working set that is actively used at any one time to something that fits into the appropriate cache level. This is not just about total array size, but about how much of the array the code touches in a single region of the loop nest.

In practice, it is useful to think about different cache levels. A strategy that fits data into L1 cache is more demanding but yields higher speed, while fitting into L3 cache might be easier but provides a smaller benefit. Analytical modeling or simple experiments can help you select tile sizes, which are discussed in the next section, that correspond to cache capacities.

Blocking and Tiling

Blocking, also called tiling, is one of the core techniques for cache optimization in scientific computing. The basic idea is to break a large computation over a big array into smaller chunks, or blocks, that fit into cache. You then perform as much computation as possible within each block before moving on to the next one.

In matrix operations, for instance, naive triple nested loops often suffer from poor cache reuse because they repeatedly traverse large rows or columns that do not fit in cache. By iterating in tiles, where you operate on submatrices of size $B \times B$, you confine the working set of each loop nest to a subset that can be kept in cache. This allows multiple passes over the same blocks of data while they remain cached.

Blocking improves both temporal locality and spatial locality. Within a block, you still traverse data linearly to get spatial locality. Across blocks, you increase temporal locality by revisiting the same block elements multiple times close together in time.

The ideal block size is a compromise between fitting the block in cache and amortizing loop overhead. It depends on element size, number of arrays involved, and cache capacity. For example, if each element is 8 bytes, and you operate on three arrays simultaneously, a block of size $B \times B$ uses roughly $3 \times B^2 \times 8$ bytes. To fit this into a cache of capacity $C$, you require
$$
3 B^2 \times 8 \le C.
$$
This gives an approximate upper bound on $B$. In practice, you choose a somewhat smaller $B$ to leave room for other data and overhead.

Blocking is not limited to matrices. It can be applied to stencils on grids, sparse data structures that have locality, and even irregular problems if you can cluster related elements.

Cache Associativity and Conflict Misses

Caches are typically set associative, which means that each memory address maps to a limited set of cache locations. If several active data structures map to the same set, even if the total data size is smaller than the cache capacity, they can still evict each other. This results in conflict misses.

For HPC codes that use regular large arrays, certain strides or array sizes can interact poorly with cache indexing and cause severe conflict behavior. For example, if multiple large arrays are aligned in memory so that corresponding elements map to the same cache lines, accessing them together can lead to systematic conflicts. This is most visible in older or smaller caches, but even modern CPUs can suffer when access patterns align badly.

There are several strategies to mitigate conflict misses. One is to adjust array layout via padding, so that rows or leading dimensions are not powers of two, and therefore do not align in a way that causes conflicts. Another is to change traversal order or block sizes so that the set of addresses accessed together are more dispersed across the cache sets.

Although the detailed mapping from address to cache set is hardware specific, a practical rule is to avoid regularly spaced large strides that are simple powers of two, especially when multiple arrays are accessed in lockstep.

Data Layout and Structure Transformations

Data layout plays a major role in cache behavior. Two designs that capture this are array of structures, often abbreviated AoS, and structure of arrays, often abbreviated SoA. In AoS, you store a sequence of records, each with multiple fields. In SoA, you store each field in its own array, so that all values of the same field are contiguous.

For computations that process one field at a time over many elements, SoA is often far more cache friendly. It provides unit stride access to just the data that are needed. In contrast, AoS layouts can waste cache bandwidth by fetching fields that are not used in the current operation, which is particularly harmful if records are large.

On the other hand, if each access requires most fields from the same record, AoS can be adequate or even preferable, because you benefit from spatial locality over the record. This means that the optimal layout for cache behavior depends on the set of kernels you execute. In HPC codes, it is common to rearrange data into SoA form or adopt hybrid layouts to match the dominant access patterns.

Another transformation is reordering indices or dimensions in arrays. If most kernels access a 3D array in a particular order, you might choose a layout that gives stride one to that order. This can sometimes be at odds with other kernels, which illustrates that cache optimization is an application wide problem, not merely a single loop fix.

When changing data layout, you must also consider alignment. Many architectures benefit from aligning arrays to cache line or SIMD width boundaries. This reduces the number of cache lines needed for aligned loads and improves SIMD efficiency, which complements cache optimizations.

Prefetching and Software Hints

Modern processors perform hardware prefetching, where the cache controller tries to predict future accesses and load cache lines early. This works best when access patterns are regular, such as stride one loops. If your loops follow simple patterns, you get the benefits of prefetching for free.

When patterns are only moderately regular, you may need to simplify your loop structure or remove data dependent branches in the inner loops, so the hardware prefetcher can see the pattern. This is an indirect form of cache optimization. It does not change which data are accessed, but changes how predictable the access stream is.

Some compilers and architectures support explicit software prefetch instructions, where you write code that requests that a specific address be loaded into cache in advance. These are mostly useful in very performance critical inner loops, and they must be timed carefully: too early and the cache line may be evicted before use, too late and you do not hide the latency. For most beginners in HPC, concentrating on regular access patterns and blocking yields more benefit than explicit prefetching.

In certain memory bound kernels with complex but analyzable patterns, advanced users may combine tiling with explicit prefetch to approach the best possible memory throughput. This represents a late stage optimization once high level locality has already been improved.

Measuring and Diagnosing Cache Behavior

Cache optimization should be guided by measurements. The parent chapter has introduced profiling tools and performance measurement. Here the focus is on metrics that indicate cache behavior.

Hardware performance counters expose statistics such as cache hits and misses at different levels, bandwidth utilization, and average memory latency. Many profiling tools and command line utilities can read these counters and report derived metrics, such as miss rates or cycles per instruction. By correlating these with source code, you can identify loops with high miss rates and low arithmetic intensity.

To diagnose cache issues, look for inner loops that spend most of their time stalled on memory accesses. If these loops also have irregular access patterns or access large arrays, they are good candidates for reordering, blocking, or data layout changes.

Microbenchmarks with simplified versions of your kernel can help you compare access patterns directly. For instance, you can test traversal of a large array with various strides and see how performance changes. This builds intuition about the cost of non unit stride and about the impact of working set size.

Effective cache optimization always combines measurement of cache related metrics, identification of hot loops, and targeted code transformations such as loop reordering, blocking, or data layout changes.

Interactions with Parallelism

Cache behavior does not occur in isolation. In HPC, codes run with threading, vectorization, and often distributed memory. Optimization choices at the cache level must therefore be consistent with parallelization strategies.

On shared memory systems, threads may share caches or have private caches. Data partitioning strategies should aim to give each thread a subset of the data that fits naturally in its local caches. Poorly arranged multithreaded programs can cause cache line ping pong, where multiple threads repeatedly update data in the same cache line, forcing it to migrate between cores. This is particularly problematic for false sharing, when threads update different variables that happen to live in the same line.

Good cache optimization in a parallel context prefers per thread data blocks that are aligned to cache line boundaries, avoids sharing frequently written data across threads in the same line, and preserves locality within each thread’s working set.

At the vectorization level, contiguous and aligned data not only give better SIMD performance but also help caches by ensuring that each load from memory brings in useful elements. Thus, cache optimization and vectorization strategies are closely linked. Choosing a structure of arrays layout, for example, often benefits both.

Finally, the choice of blocking for cache locality can interact with blocking for parallel decomposition. You may select a tiling scheme that suits both cache capacity and thread assignment, so that each thread operates on independent tiles that remain largely in its private caches. This kind of joint design is central to high performance scientific computing.

Practical Guidelines for Cache Conscious Coding

Cache optimization can become very detailed, but several practical guidelines cover most benefits for typical HPC applications.

First, design loop nests so that the innermost loop accesses contiguous data in memory. Second, keep the working set of inner loops small enough to fit in the relevant cache level, often by blocking multidimensional loops. Third, consider data layout choices, preferring structures that make the dominant access patterns contiguous, such as structure of arrays where appropriate.

Fourth, be cautious of regular large strides that can cause conflict misses, and introduce padding when needed to avoid pathological alignments. Fifth, align data structures to natural boundaries and favor predictable access patterns so that hardware prefetchers can operate effectively.

Above all, treat cache optimization as a measurement driven process. Use profiling tools to identify memory bound kernels and to confirm that code changes result in lower miss rates and higher effective bandwidth. This closes the loop between theory and practice and allows you to apply cache optimization techniques confidently to real HPC applications.

Comments

Please login to add a comment.

Don't have an account? Register now!