Kahibaro
Discord Login Register

12.4 Memory optimization

Why Memory Optimization Matters in HPC

Memory behavior often determines performance on modern HPC systems more than raw flop rate. Many codes are memory-bound: they spend most of their time waiting on data rather than computing. Effective memory optimization can:

This chapter focuses on practical techniques to make better use of the memory hierarchy in HPC applications, from caches to NUMA domains and node‑level memory capacity.

Identifying Memory Bottlenecks

Before optimizing, you need evidence that memory is the limiting factor.

Common Symptoms of Memory-Bound Code

Useful Metrics and Indicators

Using profiling tools or hardware counters, look for:

These metrics help you decide whether to target:

Working Set and Data Volume

A fundamental principle: move less data.

Reducing the Working Set

The working set is the data actively used in some period. If it fits in cache, performance can improve dramatically.

Strategies:

Reducing Memory Traffic

Even if data fits in memory, moving it repeatedly is expensive.

  // Before: two passes over A and B
  for (int i = 0; i < N; ++i)
      C[i] = A[i] + B[i];
  for (int i = 0; i < N; ++i)
      D[i] = C[i] * 2.0;
  // After: one pass
  for (int i = 0; i < N; ++i) {
      double c = A[i] + B[i];
      C[i] = c;
      D[i] = c * 2.0;
  }

There is often a trade‑off: more computation vs. less memory traffic. In a memory-bound regime, doing a bit more computation to avoid loads/stores is usually a win.

Data Layout and Access Patterns

The way data is stored and traversed has a huge impact on cache behavior and bandwidth.

Contiguous Access and Stride

Processors and memory systems are optimized for sequential accesses.

Guidelines:

    for (int i = 0; i < N; ++i)
      for (int j = 0; j < M; ++j)
        A[i][j] = ...;  // Good: j is inner index, contiguous in memory
    for (int j = 0; j < M; ++j)
      for (int i = 0; i < N; ++i)
        A[i][j] = ...;  // Poor: jumps by N each time

For multi-dimensional data stored as flattened arrays, arrange indices to keep the innermost loop walking contiguous elements.

Structure of Arrays vs. Array of Structures

Consider a 3D point:

typedef struct {
    double x;
    double y;
    double z;
} Point;
Point *points; // AoS: points[i].x, points[i].y, points[i].z

If your kernel only uses x and y, an array of structures (AoS) wastes bandwidth fetching z.

Alternative:

typedef struct {
    double *x;
    double *y;
    double *z;
} PointsSoA;
PointsSoA pts; // SoA: pts.x[i], pts.y[i], pts.z[i]

Advantages of SoA in common HPC scenarios:

AoS can be better when you always use all fields together and want them in the same cache line. In performance-critical regions, it is often worth maintaining separate “optimized” layouts even if the code outside those regions uses more convenient structures.

Alignment and Padding

Aligned data accesses are faster and more vectorization-friendly.

  double *A;
  posix_memalign((void **)&A, 64, N * sizeof(double));

Blocking and Tiling

Large problems rarely fit in the upper levels of the cache. Blocking (also called tiling) restructures loops to work on sub-blocks that fit better in cache.

Example: matrix multiplication (conceptually):

for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
      C[i][j] += A[i][k] * B[k][j];

Blocked version:

for (int ii = 0; ii < N; ii += BI)
  for (int jj = 0; jj < N; jj += BJ)
    for (int kk = 0; kk < N; kk += BK)
      for (int i = ii; i < min(ii + BI, N); ++i)
        for (int j = jj; j < min(jj + BJ, N); ++j)
          for (int k = kk; k < min(kk + BK, N); ++k)
            C[i][j] += A[i][k] * B[k][j];

Blocking is effective for:

Modern compilers sometimes perform automatic blocking, but explicit restructuring often yields better, more predictable results.

NUMA-Aware Memory Usage

On multi-socket nodes, memory is typically Non‑Uniform Memory Access (NUMA): each socket has local memory, and accessing remote memory is slower and may have less bandwidth.

First-Touch Initialization

Most Linux systems with NUMA use first-touch allocation:

To ensure data is local to the threads using it:

Example with OpenMP:

double *A = malloc(N * sizeof(double));
// Parallel first-touch
#pragma omp parallel for
for (int i = 0; i < N; ++i)
    A[i] = 0.0;
// Later computation uses the same partitioning
#pragma omp parallel for
for (int i = 0; i < N; ++i)
    A[i] = compute(i);

If a single thread initializes all memory, it will all be placed on one NUMA node, causing remote accesses for other threads.

NUMA Placement and Binding

Tools and APIs can control placement more explicitly:

Goals:

Avoiding False Sharing and Contention

Even when data fits in cache and is “local,” multi-threaded access patterns can degrade performance.

False Sharing

False sharing occurs when multiple threads write to different variables that occupy the same cache line. Each write invalidates the line for other cores, generating coherence traffic.

Example:

typedef struct {
    double x;
    double y;
} Data;
Data shared[NUM_THREADS];
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    for (int i = 0; i < N; ++i)
        shared[tid].x += 1.0; // Each thread writes to a different element,
                              // but adjacent elements may share a cache line
}

Fix with padding:

typedef struct {
    double x;
    double pad[7]; // Assuming 8 doubles per 64-byte cache line
} PaddedData;
PaddedData shared[NUM_THREADS];

Or by reorganizing so that each thread writes to a private variable and only occasionally combines results.

Contended Synchronization

Memory hotspots can appear around locks, atomics, and shared counters:

Mitigations:

Though these are synchronization issues, the underlying cost is often cache-line movement and memory coherence, so they fall under memory optimization.

Caching Behavior and Locality

Optimizing for cache means maximizing temporal and spatial locality.

Temporal Locality

Temporal locality: data accessed now will be accessed again soon.

Spatial Locality

Spatial locality: data near each other in memory is accessed close in time.

A simple example: 2D stencil update in C-like row-major storage:

for (int i = 1; i < N-1; ++i)
  for (int j = 1; j < M-1; ++j)
    out[i][j] = 0.25*(in[i-1][j] + in[i+1][j] +
                      in[i][j-1] + in[i][j+1]);

Here, keeping j as the inner loop uses spatial locality along rows; blocking in both dimensions can further exploit reuse.

Optimization Patterns for Common HPC Kernels

A few recurring patterns illustrate how memory considerations shape implementations.

Streaming Kernels (Vector Operations)

Operations like y = a*x + y (AXPY) are bandwidth‑bound.

Tips:

Stencil and Grid-Based Codes

Regular grids with neighbor accesses (stencils) are very common.

Memory strategies:

Sparse Linear Algebra

Sparse matrices can be dominated by memory traffic and irregular access.

Key ideas:

Libraries often provide optimized data structures; when writing your own, measure the impact of different layouts.

Tools and Techniques for Memory Analysis

Various tools can help you understand and optimize memory usage:

Use these tools to validate that an optimization actually reduces cache misses, improves bandwidth utilization, or decreases execution time, rather than relying on intuition alone.

Balancing Memory Optimization with Maintainability

Memory-optimized code can become complex and harder to read. In an HPC context:

Aim for a balance: apply aggressive memory optimization where it matters most (hotspots), and keep the rest of the code as simple as possible, guided by profiling results rather than premature micro-optimization.

Views: 38

Comments

Please login to add a comment.

Don't have an account? Register now!