Table of Contents
Understanding Vectorization in Practice
In this chapter, we focus on how to use vectorization effectively in real HPC codes, assuming you already know the basic idea of SIMD and vector instructions from earlier chapters.
Vectorization strategies are about:
- Restructuring code so that compilers (or you, via intrinsics) can use SIMD.
- Organizing data so that vector loads/stores are efficient.
- Avoiding patterns that prevent vectorization or make it ineffective.
- Checking that vectorization actually happens and helps.
We’ll concentrate on practical patterns and transformations you can apply to your own loops and kernels.
Recognizing Vectorization-Friendly Patterns
Compilers like simple, predictable loops. The easiest targets:
- Straight-line arithmetic on arrays
- Simple
forloops with: - A constant increment (usually
i++ori += 1) - Known bounds
- No complicated
ifbranches inside - No loop-carried dependencies
Typical good candidate:
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}This type of loop:
- Uses contiguous array elements.
- Each iteration is independent.
- Does only arithmetic + memory loads/stores.
Those are the workhorses of vectorization in HPC.
Data Layout and Memory Access Patterns
Even if the compiler wants to vectorize, the way your data is laid out in memory can help or hurt.
Structure of Arrays (SoA) vs Array of Structures (AoS)
Example AoS (less friendly to vectorization in many cases):
typedef struct {
double x, y, z;
} Point;
Point p[N];
for (int i = 0; i < N; i++) {
p[i].x = alpha * p[i].x;
}
Here, the x components are interleaved with y and z, so the memory layout is:
x0 y0 z0 x1 y1 z1 x2 y2 z2 ...
Vector loads want to pull x0 x1 x2 ... as a contiguous chunk.
SoA layout (more vector-friendly):
double px[N], py[N], pz[N];
for (int i = 0; i < N; i++) {
px[i] = alpha * px[i];
}Now memory layout is:
x0 x1 x2 ... xN-1 y0 y1 y2 ... z0 z1 z2 ...
Vector loads can operate cleanly on px without touching unrelated data.
Strategy: For performance-critical inner loops, consider SoA (or “AoSoA” / blocked layouts) instead of AoS.
Contiguous and Aligned Access
Vector loads are most efficient when:
- Data is contiguous in memory (no big strides).
- Data is correctly aligned for the vector width (e.g., 32 or 64-byte alignment).
Avoid large strides:
// Stride access (harder to vectorize efficiently)
for (int i = 0; i < N; i++) {
y[i] = a * x[i * stride];
}If stride is > 1, the compiler may still vectorize using gather instructions, but these are slower than contiguous loads.
Prefer:
for (int i = 0; i < N; i++) {
y[i] = a * x[i];
}Strategy: Organize arrays so that the “looped-over” index is the fastest varying dimension in memory.
For example, in C:
double A[n][m]; // row-major: A[i][j] is contiguous in j
// Good: j in inner loop
for (int i = 0; i < n; i++)
for (int j = 0; j < m; j++)
A[i][j] = ...;
// Worse: i in inner loop (non-contiguous)
for (int j = 0; j < m; j++)
for (int i = 0; i < n; i++)
A[i][j] = ...;Loop Transformations for Vectorization
Many loops aren’t naturally vectorizable but can be transformed to make them so.
1. Removing (or Isolating) Dependencies
A loop-carried dependency means an iteration uses results from a previous iteration. This blocks vectorization.
Example (not vectorizable as-is):
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + b[i];
}
Each a[i] depends on a[i-1]. This is essentially a prefix sum (scan) and not straightforwardly vectorizable.
But sometimes dependencies are false (an artifact of how we wrote the code), not real:
// Not obviously safe to vectorize (compiler may be conservative)
for (int i = 0; i < n; i++) {
a[i] = a[i] + b[i];
}Here there is no real inter-iteration dependency, but the compiler might be cautious (aliasing concerns). You can help the compiler with:
restrictqualifiers in C/C++ to indicate no aliasing.- Compiler-specific pragmas / directives to assert independence (discussed below).
Strategy:
- Manually factor out true dependencies into separate loops or algorithms.
- Help the compiler recognize that there are no dependencies when that is true.
2. Loop Fission (Splitting Loops)
Combining too much work in one loop can introduce conditions that block vectorization. Splitting (fission) can help.
Instead of:
for (int i = 0; i < n; i++) {
a[i] = b[i] * c[i];
if (a[i] > t) d[i] = a[i];
}Split into:
for (int i = 0; i < n; i++) {
a[i] = b[i] * c[i];
}
for (int i = 0; i < n; i++) {
if (a[i] > t) d[i] = a[i];
}The first loop is a simple multiply and very vectorizable. The second may also vectorize (using masked operations) on modern compilers and architectures.
3. Loop Fusion
Sometimes combining loops helps by improving data locality and amortizing loop overhead. Fusion can help vectorization if it reduces memory passes and keeps data in cache.
Before:
for (int i = 0; i < n; i++) {
x[i] = a * x[i];
}
for (int i = 0; i < n; i++) {
y[i] = x[i] + y[i];
}After (fused):
for (int i = 0; i < n; i++) {
x[i] = a * x[i];
y[i] = x[i] + y[i];
}Both versions can vectorize, but the fused loop:
- Reads and writes
x[i]only once. - Might provide better cache utilization and thus better SIMD throughput.
Strategy: Use fission to isolate clean vector loops; use fusion to increase work per pass over memory. Check which direction helps most in practice.
4. Loop Unrolling
Compilers often unroll loops to fill vector registers and hide latencies. Sometimes manual unrolling helps, especially when compilers are conservative.
Example (conceptual unroll by factor 4):
for (int i = 0; i < n; i += 4) {
y[i] = a * x[i] + y[i];
y[i+1] = a * x[i+1] + y[i+1];
y[i+2] = a * x[i+2] + y[i+2];
y[i+3] = a * x[i+3] + y[i+3];
}Compilers typically do this when auto-vectorizing, but for very hot loops, testing a tuned unrolled version can be worthwhile.
5. Loop Tiling / Blocking
Tiling focuses on cache usage, but it also affects vectorization:
- Inner-most, small loops are good vectorization targets.
- Tiling creates such inner loops.
Example: matrix-matrix multiply is often blocked into sub-tiles where the central micro-kernel is heavily vectorized. Libraries like BLAS are built around this idea.
Strategy: Tiling for cache often brings natural vectorizable kernels; aim to vectorize those inner kernels.
Handling Conditionals and Branches
Branches inside loops can hurt vectorization by creating divergent control flow. SIMD prefers uniform work across vector lanes.
Use Predication (Masked Operations)
Modern ISAs (like AVX-512) support masked operations:
for (int i = 0; i < n; i++) {
if (x[i] > 0.0) {
y[i] = x[i];
} else {
y[i] = 0.0;
}
}Compilers may turn this into:
- Load vector
x[i..i+W-1] - Compute mask where
x > 0 - Use masked store or blend instruction to compute
y
To help, you can sometimes write branch-free code:
for (int i = 0; i < n; i++) {
double xi = x[i];
y[i] = (xi > 0.0) ? xi : 0.0;
}Or use math idioms (when appropriate):
y[i] = fmax(x[i], 0.0);
Libraries and compilers often map functions like fmax to efficient vector instructions.
Separate Hot and Cold Paths
If only a few elements need special handling, split them out:
// Common path (vectorizable)
for (int i = 0; i < n; i++) {
if (likely_condition(x[i])) {
y[i] = fast_path(x[i]);
}
}
// Rare path, process scalar or with a separate loop
for (int i = 0; i < n; i++) {
if (!likely_condition(x[i])) {
y[i] = slow_path(x[i]);
}
}The idea is to make the main loop as uniform and branch-free as possible.
Using Compiler Directives and Flags
Auto-vectorization is controlled heavily by compiler flags and pragmas.
Enable Vectorization with Optimization Flags
Common examples (exact flags depend on your compiler):
- GCC / Clang:
-O3 -march=native -ffast-math(with care),-ftree-vectorize - Intel compilers:
-O3 -xHostor specific architecture flags - LLVM-based HPC compilers: similar
-O3 -march=...options
Be careful with flags that relax floating-point rules (like -ffast-math): they can change numerical behavior.
Pragmas / Directives to Guide Vectorization
Compilers often support directives like:
#pragma omp simd(portable OpenMP-based hint)#pragma ivdepor#pragma vector(compiler-specific)#pragma GCC ivdep#pragma clang loop vectorize(enable)
Example:
#pragma omp simd
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
#pragma omp simd tells the compiler:
- You promise there are no loop-carried dependencies that break vectorization.
- The compiler should enforce vectorization if possible.
This is powerful but dangerous if misused: only apply it when you are sure the loop is safe.
Strategy:
- Start with compiler auto-vectorization at high optimization level.
- Add
simd/vectorpragmas to critical loops only when you’ve confirmed correctness and want to force or strengthen vectorization.
Alignment Strategies
Alignment can make vector loads/stores significantly faster for some architectures.
Align Data
For dynamically allocated memory in C/C++:
- Use aligned allocators (e.g.,
posix_memalign,aligned_alloc, or C++17std::aligned_alloc, or vendor-specific intrinsics).
double *x;
posix_memalign((void**)&x, 64, n * sizeof(double)); // 64-byte alignmentYou can also use alignment attributes:
double x[N] __attribute__((aligned(64)));Or in C++:
alignas(64) double x[N];Tell the Compiler About Alignment
Some pragmas or builtins let you assert that a pointer is aligned, enabling aligned vector loads:
#pragma vector aligned
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}Again: only do this if the pointers really are aligned as promised.
Handling Remainders (“Tail” Loops)
When n is not a multiple of the vector width, the compiler typically:
- Vectorizes most of the loop (
n / W * Welements), and - Handles the remaining
n % Welements with a scalar “remainder” loop.
You can explicitly structure your loop to make this clear:
int i = 0;
int vec_end = n - (n % W);
for (; i < vec_end; i += W) {
// vectorized body (conceptually)
}
for (; i < n; i++) {
// scalar cleanup
}You usually don’t need to write it this way by hand; compilers do it. But understanding it helps you reason about edge cases and performance.
When to Use Intrinsics or Hand-Vectorized Code
Auto-vectorization plus directives covers most cases. Manual intrinsics (e.g., AVX2, AVX-512 intrinsics) are a last resort.
Use intrinsics when:
- The compiler fails to vectorize a crucial kernel in a satisfactory way.
- You need full control over exact instructions and register usage.
- The kernel is small, extremely hot, and well-isolated.
Example of an AVX2-style intrinsic (illustrative):
#include <immintrin.h>
void saxpy_avx(float a, const float *x, float *y, int n) {
int i;
for (i = 0; i + 8 <= n; i += 8) {
__m256 xv = _mm256_loadu_ps(&x[i]);
__m256 yv = _mm256_loadu_ps(&y[i]);
__m256 av = _mm256_set1_ps(a);
__m256 res = _mm256_fmadd_ps(av, xv, yv); // a*x + y
_mm256_storeu_ps(&y[i], res);
}
// scalar remainder
for (; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}Drawbacks:
- Non-portable across architectures.
- Harder to maintain and debug.
- Needs careful tuning per target CPU.
Strategy:
- Start with clean scalar code.
- Enable auto-vectorization and use directives.
- Only if you still fall short in a clear hotspot, consider a hand-vectorized micro-kernel, ideally hidden behind a portable interface.
Measuring and Verifying Vectorization
Never assume vectorization is happening or helping; always verify.
1. Compiler Reports
Most compilers can print vectorization info:
- GCC:
-fopt-info-vec-optimized -fopt-info-vec-missed - Clang:
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize - Intel:
-qopt-report=5 -qopt-report-phase=vec(or similar)
These tell you:
- Which loops were vectorized.
- Why some loops could not be vectorized.
- Any assumptions the compiler made.
Use this feedback to modify code or add hints.
2. Assembly Inspection
You can inspect assembly (objdump -d, -S, or -fverbose-asm) to confirm presence of vector instructions (vmovapd, vaddps, etc.). This is more advanced but very informative.
3. Performance Counters and Profilers
Use performance tools (covered elsewhere in the course) to:
- Measure floating-point operations per second.
- Count vector vs scalar instructions.
- Observe memory bandwidth vs compute utilization.
If vectorization is successful and beneficial, you should see:
- Fewer instructions per element.
- Higher FLOP/s (often until memory bandwidth becomes the bottleneck).
Balancing Vectorization with Numerical and Code Constraints
Vectorization is not always “free”:
- Changing loop structure can change floating-point operation order, resulting in slightly different rounding.
-ffast-mathand related flags may violate strict IEEE-754 semantics.- More complex code to enable vectorization may reduce readability and maintainability.
Strategies for balancing:
- Apply aggressive vectorization only in numerically robust parts of the code.
- Keep a simple, clear reference implementation for verification.
- Use unit tests and regression tests to ensure acceptable numerical behavior.
- Isolate performance-critical, vectorized kernels behind clear interfaces.
Practical Checklist for Vectorization
For a hot loop you want to optimize:
- Enable optimization and vectorization flags appropriate to your compiler and architecture.
- Check compiler reports to see if the loop is vectorized; fix issues indicated by the compiler.
- Simplify the loop body:
- Straight-line arithmetic.
- Avoid complex branches; use masks or branchless forms when practical.
- Ensure independence:
- No loop-carried dependencies.
- Avoid aliasing, or use
restrict/ separate arrays. - Optimize data layout:
- Prefer SoA or blocked layouts for vectorized operations.
- Ensure contiguous, well-strided access in the inner loop.
- Consider alignment and, if justified, use aligned allocation + hints.
- Use pragmas/directives (
#pragma omp simd, etc.) once you know a loop is safe. - Re-measure performance and check correctness against a reference.
- Only then, consider intrinsics if auto-vectorization still isn’t good enough.
Applied systematically, these strategies let you exploit modern vector units effectively in HPC codes without making the code base unmanageable.