2.5 SIMD and vectorization concepts

Table of Contents

What SIMD Means in Practice

SIMD stands for Single Instruction, Multiple Data. It describes hardware that applies the same operation to many data elements in parallel.

At the lowest level, a SIMD unit operates on vectors of data:

A scalar operation: c = a + b adds one pair of numbers.
A SIMD vector operation: C = A + B adds multiple pairs of numbers in one instruction:

If the SIMD width is 4 for float, a single instruction computes:
$$
C_0 = A_0 + B_0, \quad
C_1 = A_1 + B_1, \quad
C_2 = A_2 + B_2, \quad
C_3 = A_3 + B_3
$$

On modern CPUs this is implemented by:

Wide registers (e.g., 128-bit, 256-bit, 512-bit)
Vector instructions that operate on multiple numbers at once

This is data-level parallelism inside a single core, distinct from multi-core or multi-node parallelism.

SIMD Width and Data Types

The number of elements processed in parallel depends on:

The SIMD register width (e.g., 128-bit, 256-bit, 512-bit)
The data type size (e.g., 32-bit floats, 64-bit doubles, 32-bit ints)

Example for a 256-bit SIMD register:

256 / 32 = 8 single-precision floats processed at once
256 / 64 = 4 double-precision floats processed at once

Typical desktop/server SIMD extensions:

SSE: 128-bit
AVX: 256-bit
AVX-512: 512-bit
ARM NEON: usually 128-bit
ARM SVE: scalable width (e.g., 128–2048 bits, implementation-dependent)

For HPC, wider vectors generally mean higher peak throughput, but also stricter requirements on data layout and alignment.

Vectorization vs SIMD

SIMD: hardware capability (vector registers + vector instructions)
Vectorization: transforming code so it uses SIMD hardware

Vectorization can be:

Automatic (auto-vectorization): the compiler converts loops to use SIMD instructions
Manual:

Using intrinsics (functions that map directly to SIMD instructions)
Using libraries that are already vectorized
Giving hints/directives to the compiler (e.g., #pragma omp simd)

Vectorization is about rewriting or annotating code so the CPU can apply one instruction to many data elements.

Typical SIMD-Friendly Patterns

The most SIMD-friendly patterns are:

Simple loops with independent iterations, for example:

  for (int i = 0; i < n; i++) {
      c[i] = a[i] + b[i];
  }

Each iteration uses a[i], b[i], c[i] only; iterations do not depend on each other.

Common HPC kernels that often vectorize well:

Element-wise vector operations: addition, multiply, fused multiply-add
Dot products and simple reductions (with some extra handling)
Stencil operations with regular neighbor accesses (if carefully written)
Matrix-vector multiply and many linear algebra kernels

Speedup Potential and Limits

If a SIMD unit processes $W$ elements per instruction, the ideal speedup from vectorization is up to $W\times$ for the vectorized part of the code.

Example:

256-bit SIMD with single-precision floats: $W = 8$
A loop dominated by arithmetic on arrays of floats may (ideally) run up to 8× faster when fully vectorized.

In practice, actual speedup is limited by:

Memory bandwidth (can you feed data to SIMD fast enough?)
Instruction mix (non-vectorizable operations, branches)
Overheads in handling remainders (loop tails not divisible by $W$)
Imperfect auto-vectorization

Conditions for Successful Vectorization

For a compiler to safely vectorize a loop, several conditions must hold:

No Loop-Carried Dependencies

Loop iterations must be independent; that is, later iterations must not depend on results produced by earlier iterations.

Example (not vectorizable as-is):

for (int i = 1; i < n; i++) {
    a[i] = a[i] + a[i-1];  // depends on previous element
}

Example (vectorizable):

for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];  // no cross-iteration dependence
}

If there are dependencies, some can be handled with more advanced vectorization techniques, but those are harder and less universal.

No Unsafe Memory Aliasing

The compiler must be sure that different pointers do not refer to the same memory, or vectorization might change program behavior.

Example that may confuse a compiler:

void add_arrays(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

If a and b or c can overlap, vectorizing may be unsafe. In C/C++, restrict can help:

void add_arrays(float * restrict a,
                const float * restrict b,
                const float * restrict c,
                int n) {
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

This tells the compiler there is no aliasing between these pointers.

Predictable Control Flow

Branches inside vectorized loops reduce SIMD efficiency, because different elements may “want” different paths.

Better:

Remove branches by precomputing masks
Use conditional assignments (select/blend) instead of if when possible
Separate different cases into different loops

Example:

for (int i = 0; i < n; i++) {
    if (x[i] > 0.0f) {
        y[i] = x[i];
    } else {
        y[i] = 0.0f;
    }
}

Vectorized version (conceptually):

Compute mask: m[i] = (x[i] > 0)
y[i] = m[i] ? x[i] : 0.0

Modern SIMD instruction sets support masked operations to implement this.

Regular, Contiguous Memory Access

SIMD is most efficient when accessing data that is:

Contiguous in memory (unit stride)
Properly aligned
Stored in simple arrays, not scattered in complex structures

Array-of-Structures (AoS) is often less SIMD-friendly:

typedef struct {
    float x, y, z;
} Point;
Point p[n];
for (int i = 0; i < n; i++) {
    p[i].x = p[i].x * 2.0f;
}

Structure-of-Arrays (SoA) is more SIMD-friendly:

typedef struct {
    float *x;
    float *y;
    float *z;
} Points;
Points p;
for (int i = 0; i < n; i++) {
    p.x[i] = p.x[i] * 2.0f;   // contiguous array
}

In HPC, data layout is often changed to SoA or mixed layouts to enable better vectorization.

Memory Alignment and SIMD

Many SIMD instruction sets have:

Aligned load/store instructions: assume data starts at an address aligned to the vector width (e.g., 32-byte boundary for 256-bit vectors)
Unaligned load/store instructions: handle arbitrary addresses but may be slower (depending on architecture)

For best performance:

Align arrays to appropriate boundaries (e.g., 32 or 64 bytes)
Use allocators or language features that support alignment (e.g., aligned malloc, compiler extensions, or aligned attributes)

If alignment is unknown, compilers may:

Use slower unaligned instructions
Emit prologue code to “peel” loop iterations until data becomes aligned, then use aligned loads

Auto-Vectorization by Compilers

Modern compilers (GCC, Clang/LLVM, Intel, etc.) attempt auto-vectorization when optimization is enabled (e.g., -O2, -O3).

Typical behavior:

Analyze loops
Check for dependencies, aliasing, and control flow
Decide whether vectorization is safe and profitable
Generate SIMD instructions if possible

As a user, you can:

Inspect compiler reports or flags that show vectorization decisions (e.g., -ftree-vectorizer-verbose or equivalents)
Slightly modify loops (or data structures) to make vectorization easier
Use directives (e.g., #pragma omp simd, #pragma ivdep) to assert safety when the compiler cannot prove it, but you know it’s safe

Understanding these reports is important in HPC when squeezing performance out of inner loops.

Intrinsics and Low-Level Control

When auto-vectorization is insufficient, some HPC codes:

Use intrinsics: C/C++ functions that map directly to SIMD instructions (e.g., _mm256_add_ps for AVX)

Example (conceptual, AVX for float addition):

#include <immintrin.h>
void add_avx(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n-8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // handle remaining elements (n % 8)
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

This gives full control over:

Which SIMD instructions are used
Alignment decisions
Masking and special cases

However, it ties code to a specific instruction set and is more complex to maintain, so it’s used selectively in performance-critical kernels.

Vectorization and Floating-Point Behavior

Vectorization can change:

The order of operations
How rounding errors accumulate

Because floating-point arithmetic is not strictly associative:
$$
(a + b) + c \neq a + (b + c)
$$

Implications:

Results may differ slightly between scalar and vectorized versions
Reductions (sum, dot product) are especially sensitive

In HPC, it is common to:

Accept small, explainable differences within a numerical tolerance
Or use stricter options if bitwise reproducibility is needed (often at some performance cost)

Vectorization in the Wider HPC Context

Within an HPC node:

Each core uses SIMD/vectorization to exploit data-level parallelism
Many cores on the CPU exploit thread/process-level parallelism
Nodes exploit cluster-level parallelism via message passing

Vectorization is therefore:

The lowest-level, per-core performance lever
A prerequisite for approaching the advertised peak FLOPs of a CPU

Effective HPC applications:

Are written (or structured) with vectorization in mind
Use numerical libraries that are heavily hand-vectorized
Pay close attention to data layout and loop structure in performance-critical sections

Comments

Please login to add a comment.

Don't have an account? Register now!