Kahibaro
Discord Login Register

12.6 Vectorization strategies

Understanding Vectorization in Practice

In this chapter, we focus on how to use vectorization effectively in real HPC codes, assuming you already know the basic idea of SIMD and vector instructions from earlier chapters.

Vectorization strategies are about:

We’ll concentrate on practical patterns and transformations you can apply to your own loops and kernels.

Recognizing Vectorization-Friendly Patterns

Compilers like simple, predictable loops. The easiest targets:

Typical good candidate:

c
for (int i = 0; i < n; i++) {
    y[i] = a * x[i] + y[i];
}

This type of loop:

Those are the workhorses of vectorization in HPC.

Data Layout and Memory Access Patterns

Even if the compiler wants to vectorize, the way your data is laid out in memory can help or hurt.

Structure of Arrays (SoA) vs Array of Structures (AoS)

Example AoS (less friendly to vectorization in many cases):

c
typedef struct {
    double x, y, z;
} Point;
Point p[N];
for (int i = 0; i < N; i++) {
    p[i].x = alpha * p[i].x;
}

Here, the x components are interleaved with y and z, so the memory layout is:

x0 y0 z0 x1 y1 z1 x2 y2 z2 ...

Vector loads want to pull x0 x1 x2 ... as a contiguous chunk.

SoA layout (more vector-friendly):

c
double px[N], py[N], pz[N];
for (int i = 0; i < N; i++) {
    px[i] = alpha * px[i];
}

Now memory layout is:

x0 x1 x2 ... xN-1 y0 y1 y2 ... z0 z1 z2 ...

Vector loads can operate cleanly on px without touching unrelated data.

Strategy: For performance-critical inner loops, consider SoA (or “AoSoA” / blocked layouts) instead of AoS.

Contiguous and Aligned Access

Vector loads are most efficient when:

Avoid large strides:

c
// Stride access (harder to vectorize efficiently)
for (int i = 0; i < N; i++) {
    y[i] = a * x[i * stride];
}

If stride is > 1, the compiler may still vectorize using gather instructions, but these are slower than contiguous loads.

Prefer:

c
for (int i = 0; i < N; i++) {
    y[i] = a * x[i];
}

Strategy: Organize arrays so that the “looped-over” index is the fastest varying dimension in memory.

For example, in C:

c
double A[n][m];    // row-major: A[i][j] is contiguous in j
// Good: j in inner loop
for (int i = 0; i < n; i++)
  for (int j = 0; j < m; j++)
    A[i][j] = ...;
// Worse: i in inner loop (non-contiguous)
for (int j = 0; j < m; j++)
  for (int i = 0; i < n; i++)
    A[i][j] = ...;

Loop Transformations for Vectorization

Many loops aren’t naturally vectorizable but can be transformed to make them so.

1. Removing (or Isolating) Dependencies

A loop-carried dependency means an iteration uses results from a previous iteration. This blocks vectorization.

Example (not vectorizable as-is):

c
for (int i = 1; i < n; i++) {
    a[i] = a[i-1] + b[i];
}

Each a[i] depends on a[i-1]. This is essentially a prefix sum (scan) and not straightforwardly vectorizable.

But sometimes dependencies are false (an artifact of how we wrote the code), not real:

c
// Not obviously safe to vectorize (compiler may be conservative)
for (int i = 0; i < n; i++) {
    a[i] = a[i] + b[i];
}

Here there is no real inter-iteration dependency, but the compiler might be cautious (aliasing concerns). You can help the compiler with:

Strategy:

2. Loop Fission (Splitting Loops)

Combining too much work in one loop can introduce conditions that block vectorization. Splitting (fission) can help.

Instead of:

c
for (int i = 0; i < n; i++) {
    a[i] = b[i] * c[i];
    if (a[i] > t) d[i] = a[i];
}

Split into:

c
for (int i = 0; i < n; i++) {
    a[i] = b[i] * c[i];
}
for (int i = 0; i < n; i++) {
    if (a[i] > t) d[i] = a[i];
}

The first loop is a simple multiply and very vectorizable. The second may also vectorize (using masked operations) on modern compilers and architectures.

3. Loop Fusion

Sometimes combining loops helps by improving data locality and amortizing loop overhead. Fusion can help vectorization if it reduces memory passes and keeps data in cache.

Before:

c
for (int i = 0; i < n; i++) {
    x[i] = a * x[i];
}
for (int i = 0; i < n; i++) {
    y[i] = x[i] + y[i];
}

After (fused):

c
for (int i = 0; i < n; i++) {
    x[i] = a * x[i];
    y[i] = x[i] + y[i];
}

Both versions can vectorize, but the fused loop:

Strategy: Use fission to isolate clean vector loops; use fusion to increase work per pass over memory. Check which direction helps most in practice.

4. Loop Unrolling

Compilers often unroll loops to fill vector registers and hide latencies. Sometimes manual unrolling helps, especially when compilers are conservative.

Example (conceptual unroll by factor 4):

c
for (int i = 0; i < n; i += 4) {
    y[i]   = a * x[i]   + y[i];
    y[i+1] = a * x[i+1] + y[i+1];
    y[i+2] = a * x[i+2] + y[i+2];
    y[i+3] = a * x[i+3] + y[i+3];
}

Compilers typically do this when auto-vectorizing, but for very hot loops, testing a tuned unrolled version can be worthwhile.

5. Loop Tiling / Blocking

Tiling focuses on cache usage, but it also affects vectorization:

Example: matrix-matrix multiply is often blocked into sub-tiles where the central micro-kernel is heavily vectorized. Libraries like BLAS are built around this idea.

Strategy: Tiling for cache often brings natural vectorizable kernels; aim to vectorize those inner kernels.

Handling Conditionals and Branches

Branches inside loops can hurt vectorization by creating divergent control flow. SIMD prefers uniform work across vector lanes.

Use Predication (Masked Operations)

Modern ISAs (like AVX-512) support masked operations:

c
for (int i = 0; i < n; i++) {
    if (x[i] > 0.0) {
        y[i] = x[i];
    } else {
        y[i] = 0.0;
    }
}

Compilers may turn this into:

To help, you can sometimes write branch-free code:

c
for (int i = 0; i < n; i++) {
    double xi = x[i];
    y[i] = (xi > 0.0) ? xi : 0.0;
}

Or use math idioms (when appropriate):

c
y[i] = fmax(x[i], 0.0);

Libraries and compilers often map functions like fmax to efficient vector instructions.

Separate Hot and Cold Paths

If only a few elements need special handling, split them out:

c
// Common path (vectorizable)
for (int i = 0; i < n; i++) {
    if (likely_condition(x[i])) {
        y[i] = fast_path(x[i]);
    }
}
// Rare path, process scalar or with a separate loop
for (int i = 0; i < n; i++) {
    if (!likely_condition(x[i])) {
        y[i] = slow_path(x[i]);
    }
}

The idea is to make the main loop as uniform and branch-free as possible.

Using Compiler Directives and Flags

Auto-vectorization is controlled heavily by compiler flags and pragmas.

Enable Vectorization with Optimization Flags

Common examples (exact flags depend on your compiler):

Be careful with flags that relax floating-point rules (like -ffast-math): they can change numerical behavior.

Pragmas / Directives to Guide Vectorization

Compilers often support directives like:

Example:

c
#pragma omp simd
for (int i = 0; i < n; i++) {
    y[i] = a * x[i] + y[i];
}

#pragma omp simd tells the compiler:

This is powerful but dangerous if misused: only apply it when you are sure the loop is safe.

Strategy:

Alignment Strategies

Alignment can make vector loads/stores significantly faster for some architectures.

Align Data

For dynamically allocated memory in C/C++:

c
double *x;
posix_memalign((void**)&x, 64, n * sizeof(double));  // 64-byte alignment

You can also use alignment attributes:

c
double x[N] __attribute__((aligned(64)));

Or in C++:

c++
alignas(64) double x[N];

Tell the Compiler About Alignment

Some pragmas or builtins let you assert that a pointer is aligned, enabling aligned vector loads:

c
#pragma vector aligned
for (int i = 0; i < n; i++) {
    y[i] = a * x[i] + y[i];
}

Again: only do this if the pointers really are aligned as promised.

Handling Remainders (“Tail” Loops)

When n is not a multiple of the vector width, the compiler typically:

You can explicitly structure your loop to make this clear:

c
int i = 0;
int vec_end = n - (n % W);
for (; i < vec_end; i += W) {
    // vectorized body (conceptually)
}
for (; i < n; i++) {
    // scalar cleanup
}

You usually don’t need to write it this way by hand; compilers do it. But understanding it helps you reason about edge cases and performance.

When to Use Intrinsics or Hand-Vectorized Code

Auto-vectorization plus directives covers most cases. Manual intrinsics (e.g., AVX2, AVX-512 intrinsics) are a last resort.

Use intrinsics when:

Example of an AVX2-style intrinsic (illustrative):

c
#include <immintrin.h>
void saxpy_avx(float a, const float *x, float *y, int n) {
    int i;
    for (i = 0; i + 8 <= n; i += 8) {
        __m256 xv = _mm256_loadu_ps(&x[i]);
        __m256 yv = _mm256_loadu_ps(&y[i]);
        __m256 av = _mm256_set1_ps(a);
        __m256 res = _mm256_fmadd_ps(av, xv, yv); // a*x + y
        _mm256_storeu_ps(&y[i], res);
    }
    // scalar remainder
    for (; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

Drawbacks:

Strategy:

  1. Start with clean scalar code.
  2. Enable auto-vectorization and use directives.
  3. Only if you still fall short in a clear hotspot, consider a hand-vectorized micro-kernel, ideally hidden behind a portable interface.

Measuring and Verifying Vectorization

Never assume vectorization is happening or helping; always verify.

1. Compiler Reports

Most compilers can print vectorization info:

These tell you:

Use this feedback to modify code or add hints.

2. Assembly Inspection

You can inspect assembly (objdump -d, -S, or -fverbose-asm) to confirm presence of vector instructions (vmovapd, vaddps, etc.). This is more advanced but very informative.

3. Performance Counters and Profilers

Use performance tools (covered elsewhere in the course) to:

If vectorization is successful and beneficial, you should see:

Balancing Vectorization with Numerical and Code Constraints

Vectorization is not always “free”:

Strategies for balancing:

Practical Checklist for Vectorization

For a hot loop you want to optimize:

  1. Enable optimization and vectorization flags appropriate to your compiler and architecture.
  2. Check compiler reports to see if the loop is vectorized; fix issues indicated by the compiler.
  3. Simplify the loop body:
    • Straight-line arithmetic.
    • Avoid complex branches; use masks or branchless forms when practical.
  4. Ensure independence:
    • No loop-carried dependencies.
    • Avoid aliasing, or use restrict / separate arrays.
  5. Optimize data layout:
    • Prefer SoA or blocked layouts for vectorized operations.
    • Ensure contiguous, well-strided access in the inner loop.
  6. Consider alignment and, if justified, use aligned allocation + hints.
  7. Use pragmas/directives (#pragma omp simd, etc.) once you know a loop is safe.
  8. Re-measure performance and check correctness against a reference.
  9. Only then, consider intrinsics if auto-vectorization still isn’t good enough.

Applied systematically, these strategies let you exploit modern vector units effectively in HPC codes without making the code base unmanageable.

Views: 81

Comments

Please login to add a comment.

Don't have an account? Register now!