Kahibaro
Discord Login Register

Vectorization strategies

From Scalar Code to Vectorized Code

Vectorization strategies are about turning scalar operations on single values into operations on short vectors of values that execute in one instruction. The performance chapter already introduced why this can be important. Here the focus is how to actually make your code vectorize in practice, and how to recognize when it does or does not.

Modern CPUs have SIMD instruction sets such as SSE, AVX, AVX2, AVX-512, or ARM NEON and SVE. These instructions operate on registers that hold several numbers at once. For example, with AVX2 a 256 bit register can hold four double precision values or eight single precision values. A single SIMD add can therefore add four double precision values in the time a scalar add would add only one.

The compiler will try to use these SIMD instructions automatically when it can prove that it is safe and profitable. Your vectorization strategy is to write code and use compiler options in a way that makes this automatic vectorization easier and more effective, and to know when and how to intervene manually.

Key idea: Write code that is easy for the compiler to prove safe for SIMD, use appropriate compiler flags, and check reports or assembly to verify that vectorization actually happens.

Automatic Vectorization by the Compiler

Most compilers used in HPC have powerful automatic vectorizers. For C, C++, and Fortran, they scan loops and attempt to transform them into SIMD loops. The exact flags and behavior differ between compilers, but the general principles are similar.

A classic vectorization candidate is a simple loop that applies the same operation independently to each element of an array, for example:

for (int i = 0; i < n; ++i) {
    a[i] = b[i] + c[i];
}

If the compiler can be sure that a, b, and c do not overlap in memory, and there are no data dependences across iterations, it can emit a SIMD loop. You usually tell the compiler to do aggressive vectorization through optimization flags that will be covered in detail elsewhere, but here it is important to know that higher optimization levels and specific vectorization flags are required. Examples are -O3 -ftree-vectorize for GCC, -xHost or -qopt-zmm-usage for Intel compilers, and -O3 -ffast-math with care if you allow relaxed floating point semantics.

Compilers can emit detailed vectorization reports. For example, GCC can produce such a report with options like -fopt-info-vec-optimized and -fopt-info-vec-missed. These messages tell you which loops were vectorized and why some loops were not. Reading these reports is a core part of a vectorization strategy, because it points directly to obstacles in your code.

Rule: Always verify vectorization with compiler reports or tools. Never assume a loop is vectorized just because it looks vectorizable.

Conditions for Safe Vectorization

Vectorization is only legal if the semantics of the program do not change. The compiler needs to prove that no iteration of a loop depends on the result of another iteration in a way that would be violated by executing several iterations in parallel.

The main categories of constraints the compiler checks are:

Data dependencies across iterations. If a statement in iteration $i$ uses a value written in iteration $i-1$, the compiler cannot simply reorder or group iterations. For example:

for (int i = 1; i < n; ++i) {
    a[i] = a[i-1] + 1.0;
}

This loop has a loop-carried dependence on a[i-1], so a straightforward SIMD transformation would change the result. Such loops are usually not vectorizable unless they are transformed mathematically into a prefix-type operation by special algorithms.

Pointer aliasing. In C and C++, pointers may refer to overlapping memory regions. If the compiler suspects that a and b might overlap, it must be conservative. The following loop might not be vectorized:

void saxpy(int n, float *a, float *b, float alpha) {
    for (int i = 0; i < n; ++i) {
        a[i] = alpha * a[i] + b[i];
    }
}

If a and b point to the same memory region, the semantics change when loading and storing in a different order. You can help the compiler using language features like restrict in C, which declares that pointers do not alias:

void saxpy(int n, float * __restrict a,
           const float * __restrict b,
           float alpha) {
    for (int i = 0; i < n; ++i) {
        a[i] = alpha * a[i] + b[i];
    }
}

This tells the compiler that a and b do not overlap, which often enables vectorization.

Control flow inside loops. Branches like if, break, and continue inside a loop make vectorization more complex. Compilers sometimes transform the branches into vector masks, but if the branch conditions are irregular or depend on earlier iterations, they may give up.

Function calls and unknown side effects. A loop that calls a function that might have side effects or interact with global state is difficult to vectorize. The compiler must know that the function is pure or has limited side effects. Inlining or using attributes that describe purity can help.

Floating point behavior. Strict IEEE floating point semantics disallow some reorderings that vectorization might introduce. If your application can tolerate slight changes in rounding, you can enable more aggressive vectorization with flags that relax floating point rules, such as -ffast-math in GCC or similar options in other compilers. This must be a considered decision, especially for numerically sensitive codes.

Rule: For a loop to vectorize, there must be no loop-carried dependencies, no unsafe aliasing, and no side effects that depend on the exact order of iterations.

Writing Vectorization-Friendly Loops

Your primary tool as an HPC developer is loop structure. Many vectorization strategies consist of rewriting loops into simpler, more regular shapes.

Keep loop bodies simple. Prefer a single, straight-line sequence of arithmetic and loads and stores inside the loop. Move invariant calculations outside the loop, and avoid function calls that are not clearly inlineable. For example, instead of:

for (int i = 0; i < n; ++i) {
    double scale = exp(alpha * t[i]);
    x[i] = x[i] * scale;
}

you might precompute scale outside the most performance critical loop if possible, or gather data that lets you reuse results.

Simplify control flow. Replace complex if statements with conditional expressions that compilers can implement via masks. For instance, instead of:

for (int i = 0; i < n; ++i) {
    if (x[i] > 0.0) {
        y[i] = x[i];
    } else {
        y[i] = 0.0;
    }
}

you can often write:

for (int i = 0; i < n; ++i) {
    y[i] = (x[i] > 0.0) ? x[i] : 0.0;
}

This is still a conditional, but many compilers can turn it into vectorized select operations.

Expose contiguous memory access. Vector units are most efficient when each lane accesses consecutive elements in memory. Use array-of-structs versus struct-of-arrays carefully. In many numerical codes, struct-of-arrays layouts are better for vectorization. For example, instead of:

typedef struct {
    double x;
    double y;
    double z;
} Point;
Point p[n];
for (int i = 0; i < n; ++i) {
    p[i].x = p[i].x + alpha;
}

you can use three separate arrays:

double px[n], py[n], pz[n];
for (int i = 0; i < n; ++i) {
    px[i] = px[i] + alpha;
}

This allows the compiler to load a vector of px values with a single SIMD load, rather than gathering from scattered struct fields.

Minimize loop-carried state. Shift variables that depend on the previous iteration into prefix sums or separate phases when possible. For example, if you can rewrite a recurrence into an independent formula, it often becomes vectorizable.

Alignment and Memory Layout Considerations

Vector instructions often work best when data is aligned to specific byte boundaries, such as 16 bytes for SSE or 32 bytes for AVX. Misaligned data can still be loaded, but it can incur penalties or require multiple instructions.

Modern compilers and CPUs are more tolerant of misalignment than older ones, but for performance critical kernels it can still matter. Common strategies include aligning dynamically allocated arrays using appropriate allocation functions and informing the compiler through attributes or pragmas that an array is aligned.

For example, in C you might allocate aligned memory and then use that in a loop:

double *a;
posix_memalign((void**)&a, 32, n * sizeof(double));
for (int i = 0; i < n; ++i) {
    a[i] = 0.0;
}

In some compilers you can add hints like:

double * __restrict a = aligned_alloc(32, n * sizeof(double));
#pragma omp simd aligned(a:32)
for (int i = 0; i < n; ++i) {
    a[i] = 0.0;
}

The explicit alignment clause tells the compiler it may use aligned SIMD loads and stores.

Memory layout also interacts with stride. A stride of 1 in array index is ideal. Strides greater than 1 may still be vectorizable but can generate more complex gather and scatter operations, which are slower and less predictable. Where possible, reorganize data structures so that critical loops operate on unit stride arrays.

Rule: Use contiguous, aligned arrays for performance critical vectorized loops, and avoid indirect or strided accesses inside hot loops whenever possible.

Combining OpenMP and Vectorization

OpenMP is often used in HPC codes. Besides thread level parallelism, OpenMP provides explicit SIMD constructs that guide the compiler to vectorize loops and describe safety conditions.

The #pragma omp simd directive can request that a loop be vectorized even if the compiler is not certain by itself. You, the programmer, assert that the loop has no dependences that would break semantics, or you tell the compiler how to handle reductions and other features. For example:

#pragma omp simd
for (int i = 0; i < n; ++i) {
    a[i] = b[i] * c[i];
}

You can also express reductions:

double sum = 0.0;
#pragma omp simd reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i];
}

Here the compiler can generate SIMD reduction operations inside the loop.

OpenMP allows clauses like aligned, linear, and safelen to give more information. For example:

#pragma omp simd aligned(a,b:32) safelen(8)
for (int i = 0; i < n; ++i) {
    a[i] = b[i] + 1.0;
}

The aligned clause tells the compiler about alignment, and safelen limits the vector length for correctness when there are subtle dependencies that only allow a limited degree of parallelism.

OpenMP also allows nested use of thread parallelism and SIMD parallelism. A common pattern is to use #pragma omp parallel for for distributing iterations across cores, and an inner #pragma omp simd for vectorizing computation inside each thread. Correct use of such constructs is an effective vectorization strategy on modern multicore CPUs.

Manual Intrinsics and Low-Level Vectorization

Sometimes compilers cannot exploit all vectorization opportunities automatically. In performance critical kernels that dominate total runtime, you might drop down to SIMD intrinsics to control vectorization directly. Intrinsics are C or C++ functions that map nearly one to one to specific SIMD instructions.

For example, AVX2 intrinsics for double precision vector addition look like this:

#include <immintrin.h>
void add(int n, const double *a, const double *b, double *c) {
    int i;
    for (i = 0; i <= n - 4; i += 4) {
        __m256d va = _mm256_loadu_pd(&a[i]);
        __m256d vb = _mm256_loadu_pd(&b[i]);
        __m256d vc = _mm256_add_pd(va, vb);
        _mm256_storeu_pd(&c[i], vc);
    }
    for (; i < n; ++i) {
        c[i] = a[i] + b[i];
    }
}

Here _mm256_loadu_pd and _mm256_add_pd directly encode AVX2 operations. This style gives very fine control but ties your code to a specific instruction set and reduces portability and maintainability. The recommended strategy is to start with auto vectorization, use compiler hints and OpenMP SIMD directives, and resort to intrinsics only for a small number of hot kernels after careful profiling.

When intrinsics are used, a typical pattern is to separate them into specialized source files or functions, encapsulate them behind a clean interface, and optionally provide multiple implementations for different instruction sets that are chosen at compile time or run time.

Rule: Reserve manual intrinsics for critical kernels after exhausting higher level vectorization options, and always guard them with careful testing and possibly fallback implementations.

Dealing with Conditionals and Irregular Code

Real applications often have conditionals, indirect addressing, or data-dependent control flow inside loops. These features make vectorization more difficult, but there are strategies to mitigate the problem.

A common idea is to separate regular and irregular parts. For example, if only some elements require special handling, you can first process all elements in a simple vectorized loop and then handle the few special cases in a separate scalar pass. Alternatively, you can build a list of indices that satisfy a condition and then process that list in a second loop.

Mask-based vectorization is another concept. Modern SIMD instruction sets like AVX-512 support explicit predicate masks that determine which lanes are active. Compilers can generate these masks for conditional assignments, but you can also manually reorganize loops to express computations in a mask-friendly way, such as replacing branches with select operations when that does not threaten numerical stability.

Sometimes you can perform loop splitting. If a loop has a boundary condition that affects only a small number of iterations, you can peel off the boundary cases. For example, rather than writing:

for (int i = 0; i < n; ++i) {
    if (i == 0) {
        a[i] = f_boundary();
    } else {
        a[i] = f_interior(a[i-1], a[i]);
    }
}

you may treat the first iteration separately:

a[0] = f_boundary();
for (int i = 1; i < n; ++i) {
    a[i] = f_interior(a[i-1], a[i]);
}

Although the remaining loop still has a dependence on a[i-1] that may prevent vectorization, similar transformations can help in more complex cases where dependencies are local to just a few iterations.

Loop reordering is also helpful. For nested loops, choosing the inner loop to be the one with simple, independent computation on arrays often increases vectorization opportunities. For example, in a 2D array a[i][j], ensuring that j is the inner index in memory layout and in the loop order is usually beneficial.

Measuring and Tuning Vectorization Performance

Vectorization strategies must be validated. It is not enough to observe that the compiler claims a loop was vectorized. You should measure the impact on performance, and sometimes inspect generated machine code.

Some practical steps are:

Use compiler vectorization reports. Look for statements such as "loop vectorized" and "loop not vectorized: reason". Address the reported reasons: aliasing, dependence, or missing cost model benefits.

Inspect hot loops with tools that show assembly or microarchitectural behavior. Many profilers can indicate whether code uses vector instructions and how many instructions per cycle you are achieving.

Perform targeted microbenchmarks. For a given core kernel, run tests with and without vectorization flags to measure speedups. Vectorization alone can often give speedups of two to four times or more, but the exact gain depends on memory bandwidth and other factors.

Be aware of memory bound versus compute bound behavior. If a kernel is limited by memory bandwidth rather than CPU arithmetic, adding more vector arithmetic might not help once loads and stores saturate the memory subsystem. In such cases, vectorization still helps reduce overhead, but improvements may be smaller than the theoretical SIMD width.

Pay attention to vector length and portability. Some compilers can generate code that adapts to different SIMD widths at run time using a model sometimes called "vector length agnostic" or "multi-versioning." When writing portable code, avoid hard coding assumptions like "the vector length is 4." Instead, trust the compiler or use abstractions that hide vector width, such as high level math libraries, compiler directives, or C++ template libraries for SIMD.

Statement: Effective vectorization is an empirical process. Combine static analysis of loops, compiler reports, and run time profiling to guide your changes and confirm real performance gains.

Interaction with Other Optimizations

Vectorization rarely stands alone. It interacts with many other optimizations that are covered elsewhere, including cache optimization, tiling, and multi-threading. Here it is important to remember that changes to data layout or loop ordering made for cache friendliness can either help or hinder vectorization.

For example, tiling a loop nest to improve cache reuse may change stride patterns and vectorization opportunities. A practical strategy is to start with a simple, well vectorized inner loop, then gradually add tiling and parallelism, while checking at each step that vectorization is still applied and that performance improves or at least does not regress.

Loop unrolling is sometimes used automatically by compilers to improve pipeline utilization and to help vectorization. Manual unrolling is usually not necessary when compilers are configured correctly, but you should understand that explicit unrolling can sometimes confuse auto vectorizers if it introduces complex index expressions.

Function inlining is also important. Small functions inside loops should often be inlined in order for the compiler to see the full loop body and vectorize it. Use of explicit inlining hints or simply enabling higher optimization levels helps.

Summary of Practical Strategies

Effective vectorization strategies for HPC workloads follow a practical pattern. Identify hot numeric loops through profiling. Simplify those loops by removing unnecessary control flow and dependencies, and by ensuring they operate on contiguous, aligned arrays. Use language features to tell the compiler about aliasing and alignment. Enable aggressive vectorization options, examine compiler vectorization reports, and address the specific reasons a loop is not vectorized. Use OpenMP SIMD directives to guide the compiler, and consider manual intrinsics only for a small set of critical kernels where auto vectorization cannot reach expected performance. Always validate any change by measuring its impact on run time and numerical behavior.

With these strategies, vectorization becomes a systematic tool in your broader performance optimization effort, rather than an ad hoc trick.

Views: 1

Comments

Please login to add a comment.

Don't have an account? Register now!