12.4 Memory optimization

Why Memory Optimization Matters in HPC

Memory behavior often determines performance on modern HPC systems more than raw flop rate. Many codes are memory-bound: they spend most of their time waiting on data rather than computing. Effective memory optimization can:

Increase achieved bandwidth and reduce latency stalls
Allow larger problems to fit in memory
Reduce communication and I/O pressure
Improve energy efficiency and scalability

This chapter focuses on practical techniques to make better use of the memory hierarchy in HPC applications, from caches to NUMA domains and node‑level memory capacity.

Identifying Memory Bottlenecks

Before optimizing, you need evidence that memory is the limiting factor.

Common Symptoms of Memory-Bound Code

CPU utilization is low despite many cores being active.
Performance hardly improves when using more cores on the same node.
Using higher compiler optimization levels yields little speedup.
Roofline analysis (covered elsewhere) places you below the bandwidth roof, not the compute roof.
Hardware counters show high cache miss rates or stalled cycles.

Useful Metrics and Indicators

Using profiling tools or hardware counters, look for:

Cache miss rates: L1, L2, L3 miss percentages
Memory bandwidth utilization: GB/s vs. theoretical peak
Stall reasons: cycles stalled on memory, loads, stores
Instructions per cycle (IPC): low IPC with many memory stalls
NUMA metrics: remote vs. local memory reads/writes

These metrics help you decide whether to target:

Spatial/temporal locality (cache behavior)
Bandwidth usage (streaming vs. random access)
NUMA placement (local vs. remote memory)
Data volume (working set too large)

Working Set and Data Volume

A fundamental principle: move less data.

Reducing the Working Set

The working set is the data actively used in some period. If it fits in cache, performance can improve dramatically.

Strategies:

Use more compact data structures:

Prefer float over double where accuracy allows.
Use narrower integer types (int32_t instead of int64_t) when safe.
Pack related flags or small values (bitfields, small structs).

Avoid unnecessary duplication:

Remove temporary arrays that can be computed on the fly.
Overwrite in place when legal (no correctness issues).
Store derived quantities only when reused enough to pay off.

Use structure-of-arrays (SoA) instead of array-of-structures (AoS) for hot data, or vice versa, depending on access pattern and vectorization (see below).

Reducing Memory Traffic

Even if data fits in memory, moving it repeatedly is expensive.

Fuse loops to reuse loaded data:

  // Before: two passes over A and B
  for (int i = 0; i < N; ++i)
      C[i] = A[i] + B[i];
  for (int i = 0; i < N; ++i)
      D[i] = C[i] * 2.0;
  // After: one pass
  for (int i = 0; i < N; ++i) {
      double c = A[i] + B[i];
      C[i] = c;
      D[i] = c * 2.0;
  }

Avoid repeated conversion or copying between formats in inner loops.
Cache results of expensive but reusable operations, as long as caching doesn’t blow up the working set.

There is often a trade‑off: more computation vs. less memory traffic. In a memory-bound regime, doing a bit more computation to avoid loads/stores is usually a win.

Data Layout and Access Patterns

The way data is stored and traversed has a huge impact on cache behavior and bandwidth.

Contiguous Access and Stride

Processors and memory systems are optimized for sequential accesses.

Guidelines:

Access arrays with unit stride whenever possible.
In C/C++ (row-major), the last index varies fastest:

Prefer loops like:

    for (int i = 0; i < N; ++i)
      for (int j = 0; j < M; ++j)
        A[i][j] = ...;  // Good: j is inner index, contiguous in memory

Avoid:

    for (int j = 0; j < M; ++j)
      for (int i = 0; i < N; ++i)
        A[i][j] = ...;  // Poor: jumps by N each time

In Fortran (column‑major), the first index should be the innermost loop.

For multi-dimensional data stored as flattened arrays, arrange indices to keep the innermost loop walking contiguous elements.

Structure of Arrays vs. Array of Structures

Consider a 3D point:

typedef struct {
    double x;
    double y;
    double z;
} Point;
Point *points; // AoS: points[i].x, points[i].y, points[i].z

If your kernel only uses x and y, an array of structures (AoS) wastes bandwidth fetching z.

Alternative:

typedef struct {
    double *x;
    double *y;
    double *z;
} PointsSoA;
PointsSoA pts; // SoA: pts.x[i], pts.y[i], pts.z[i]

Advantages of SoA in common HPC scenarios:

Better cache utilization when only a subset of fields is used.
Easier vectorization (contiguous elements of the same field).
Less wasted memory bandwidth.

AoS can be better when you always use all fields together and want them in the same cache line. In performance-critical regions, it is often worth maintaining separate “optimized” layouts even if the code outside those regions uses more convenient structures.

Alignment and Padding

Aligned data accesses are faster and more vectorization-friendly.

Align frequently accessed arrays on cache-line boundaries (e.g., 64 bytes on many CPUs).
Many compilers and libraries support alignment directives or allocation routines:

  double *A;
  posix_memalign((void **)&A, 64, N * sizeof(double));

Avoid false sharing (covered further below) by padding to cache-line size.
Be careful with padding: too much unconditional padding can increase memory footprint.

Blocking and Tiling

Large problems rarely fit in the upper levels of the cache. Blocking (also called tiling) restructures loops to work on sub-blocks that fit better in cache.

Example: matrix multiplication (conceptually):

for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i][j] += A[i][k] * B[k][j];

Blocked version:

for (int ii = 0; ii < N; ii += BI)
  for (int jj = 0; jj < N; jj += BJ)
    for (int kk = 0; kk < N; kk += BK)
      for (int i = ii; i < min(ii + BI, N); ++i)
        for (int j = jj; j < min(jj + BJ, N); ++j)
          for (int k = kk; k < min(kk + BK, N); ++k)
            C[i][j] += A[i][k] * B[k][j];

Choose block sizes BI, BJ, BK to fit in cache (through experimentation or modeling).
This can reduce cache misses and greatly increase effective reuse.

Blocking is effective for:

Dense linear algebra (matrix operations, stencil codes).
Many structured grid problems.
Multi-dimensional loops where each element is reused across iterations.

Modern compilers sometimes perform automatic blocking, but explicit restructuring often yields better, more predictable results.

NUMA-Aware Memory Usage

On multi-socket nodes, memory is typically Non‑Uniform Memory Access (NUMA): each socket has local memory, and accessing remote memory is slower and may have less bandwidth.

First-Touch Initialization

Most Linux systems with NUMA use first-touch allocation:

Physical memory is mapped to the NUMA node of the core that first writes to the page.

To ensure data is local to the threads using it:

Initialize arrays in parallel using the same partitioning as the main computation.

Example with OpenMP:

double *A = malloc(N * sizeof(double));
// Parallel first-touch
#pragma omp parallel for
for (int i = 0; i < N; ++i)
    A[i] = 0.0;
// Later computation uses the same partitioning
#pragma omp parallel for
for (int i = 0; i < N; ++i)
    A[i] = compute(i);

If a single thread initializes all memory, it will all be placed on one NUMA node, causing remote accesses for other threads.

NUMA Placement and Binding

Tools and APIs can control placement more explicitly:

numactl on Linux (numactl --cpunodebind=0 --membind=0 ./app).
Thread and process pinning (affinity) via runtime options or APIs.
MPI rank placement strategies to align ranks with NUMA domains.

Goals:

Threads mostly access memory on their local NUMA node.
Avoid heavy cross‑socket traffic, which can cause both latency and bandwidth issues.

Avoiding False Sharing and Contention

Even when data fits in cache and is “local,” multi-threaded access patterns can degrade performance.

False Sharing

False sharing occurs when multiple threads write to different variables that occupy the same cache line. Each write invalidates the line for other cores, generating coherence traffic.

Example:

typedef struct {
    double x;
    double y;
} Data;
Data shared[NUM_THREADS];
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    for (int i = 0; i < N; ++i)
        shared[tid].x += 1.0; // Each thread writes to a different element,
                              // but adjacent elements may share a cache line
}

Fix with padding:

typedef struct {
    double x;
    double pad[7]; // Assuming 8 doubles per 64-byte cache line
} PaddedData;
PaddedData shared[NUM_THREADS];

Or by reorganizing so that each thread writes to a private variable and only occasionally combines results.

Contended Synchronization

Memory hotspots can appear around locks, atomics, and shared counters:

A single shared counter updated by many threads leads to cache-line ping-pong.
Frequent global reductions with atomics reduce scalability.

Mitigations:

Use thread-local buffers or privatization and combine once per iteration.
Perform tree-based or hierarchical reductions rather than single shared variables.
Minimize critical sections in hot loops; move synchronization out if possible.

Though these are synchronization issues, the underlying cost is often cache-line movement and memory coherence, so they fall under memory optimization.

Caching Behavior and Locality

Optimizing for cache means maximizing temporal and spatial locality.

Temporal Locality

Temporal locality: data accessed now will be accessed again soon.

Reuse loaded data before moving on:

Reorder loops to perform all operations on an element while it is “hot” in cache.
Combine related computations into a single pass when possible (loop fusion).

Avoid patterns that reload the same data many times, such as repeatedly scanning large arrays when results could be stored or aggregated in fewer passes.

Spatial Locality

Spatial locality: data near each other in memory is accessed close in time.

Traverse multi-dimensional arrays in cache-friendly orders, as discussed above.
Group related data that is accessed together into contiguous regions.
Use blocking to process small regions fully before moving to the next.

A simple example: 2D stencil update in C-like row-major storage:

for (int i = 1; i < N-1; ++i)
  for (int j = 1; j < M-1; ++j)
    out[i][j] = 0.25*(in[i-1][j] + in[i+1][j] +
                      in[i][j-1] + in[i][j+1]);

Here, keeping j as the inner loop uses spatial locality along rows; blocking in both dimensions can further exploit reuse.

Optimization Patterns for Common HPC Kernels

A few recurring patterns illustrate how memory considerations shape implementations.

Streaming Kernels (Vector Operations)

Operations like y = a*x + y (AXPY) are bandwidth‑bound.

Tips:

Ensure x and y are aligned and contiguous.
Avoid unnecessary temporaries or indirect addressing.
Use vector-friendly data layouts and let the compiler or libraries (e.g., BLAS) handle low-level tuning.
For multi-threaded streaming, partition arrays in large contiguous chunks to each thread.

Stencil and Grid-Based Codes

Regular grids with neighbor accesses (stencils) are very common.

Memory strategies:

Choose data layout and loop ordering to make neighbor accesses close in memory.
Use blocking/tiling in space (and sometimes time) to keep updated regions in cache.
Consider storing halo/ghost cells contiguously to reduce strided accesses.

Sparse Linear Algebra

Sparse matrices can be dominated by memory traffic and irregular access.

Key ideas:

Use compact storage formats (CSR, CSC, ELL, etc.) suited to your sparsity pattern.
Reorder rows/columns to improve locality (e.g., bandwidth reduction algorithms).
Keep index and value arrays tightly packed and avoid over-allocation.
Minimize indirect references through multiple levels of pointers.

Libraries often provide optimized data structures; when writing your own, measure the impact of different layouts.

Tools and Techniques for Memory Analysis

Various tools can help you understand and optimize memory usage:

Hardware performance counters via:

perf on Linux
Vendor tools (Intel VTune, AMD uProf, etc.)

Cache and memory profilers:

Identifying hot memory regions, miss hotspots, bandwidth usage.

Roofline tools (often integrated into profilers) to see whether kernels are memory- or compute-bound.
Lightweight instrumentation:

Timing specific loops before and after layout or loop-order changes.
Tracking allocated memory versus problem size to ensure scalability.

Use these tools to validate that an optimization actually reduces cache misses, improves bandwidth utilization, or decreases execution time, rather than relying on intuition alone.

Balancing Memory Optimization with Maintainability

Memory-optimized code can become complex and harder to read. In an HPC context:

Isolate heavily tuned kernels in separate, well-documented modules.
Add comments explaining why specific layouts, blocking factors, or padding exist.
Use tests to ensure that refactoring for performance does not break correctness.
Prefer using well-tuned numerical libraries when available instead of reinventing them.

Aim for a balance: apply aggressive memory optimization where it matters most (hotspots), and keep the rest of the code as simple as possible, guided by profiling results rather than premature micro-optimization.

Comments

Please login to add a comment.

Don't have an account? Register now!