Kahibaro
Discord Login Register

2.5 SIMD and vectorization concepts

What SIMD Means in Practice

SIMD stands for Single Instruction, Multiple Data. It describes hardware that applies the same operation to many data elements in parallel.

At the lowest level, a SIMD unit operates on vectors of data:

On modern CPUs this is implemented by:

This is data-level parallelism inside a single core, distinct from multi-core or multi-node parallelism.

SIMD Width and Data Types

The number of elements processed in parallel depends on:

Example for a 256-bit SIMD register:

Typical desktop/server SIMD extensions:

For HPC, wider vectors generally mean higher peak throughput, but also stricter requirements on data layout and alignment.

Vectorization vs SIMD

Vectorization can be:

Vectorization is about rewriting or annotating code so the CPU can apply one instruction to many data elements.

Typical SIMD-Friendly Patterns

The most SIMD-friendly patterns are:

  for (int i = 0; i < n; i++) {
      c[i] = a[i] + b[i];
  }

Each iteration uses a[i], b[i], c[i] only; iterations do not depend on each other.

Common HPC kernels that often vectorize well:

Speedup Potential and Limits

If a SIMD unit processes $W$ elements per instruction, the ideal speedup from vectorization is up to $W\times$ for the vectorized part of the code.

Example:

In practice, actual speedup is limited by:

Conditions for Successful Vectorization

For a compiler to safely vectorize a loop, several conditions must hold:

No Loop-Carried Dependencies

Loop iterations must be independent; that is, later iterations must not depend on results produced by earlier iterations.

Example (not vectorizable as-is):

for (int i = 1; i < n; i++) {
    a[i] = a[i] + a[i-1];  // depends on previous element
}

Example (vectorizable):

for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];  // no cross-iteration dependence
}

If there are dependencies, some can be handled with more advanced vectorization techniques, but those are harder and less universal.

No Unsafe Memory Aliasing

The compiler must be sure that different pointers do not refer to the same memory, or vectorization might change program behavior.

Example that may confuse a compiler:

void add_arrays(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

If a and b or c can overlap, vectorizing may be unsafe. In C/C++, restrict can help:

void add_arrays(float * restrict a,
                const float * restrict b,
                const float * restrict c,
                int n) {
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

This tells the compiler there is no aliasing between these pointers.

Predictable Control Flow

Branches inside vectorized loops reduce SIMD efficiency, because different elements may “want” different paths.

Better:

Example:

for (int i = 0; i < n; i++) {
    if (x[i] > 0.0f) {
        y[i] = x[i];
    } else {
        y[i] = 0.0f;
    }
}

Vectorized version (conceptually):

Modern SIMD instruction sets support masked operations to implement this.

Regular, Contiguous Memory Access

SIMD is most efficient when accessing data that is:

Array-of-Structures (AoS) is often less SIMD-friendly:

typedef struct {
    float x, y, z;
} Point;
Point p[n];
for (int i = 0; i < n; i++) {
    p[i].x = p[i].x * 2.0f;
}

Structure-of-Arrays (SoA) is more SIMD-friendly:

typedef struct {
    float *x;
    float *y;
    float *z;
} Points;
Points p;
for (int i = 0; i < n; i++) {
    p.x[i] = p.x[i] * 2.0f;   // contiguous array
}

In HPC, data layout is often changed to SoA or mixed layouts to enable better vectorization.

Memory Alignment and SIMD

Many SIMD instruction sets have:

For best performance:

If alignment is unknown, compilers may:

Auto-Vectorization by Compilers

Modern compilers (GCC, Clang/LLVM, Intel, etc.) attempt auto-vectorization when optimization is enabled (e.g., -O2, -O3).

Typical behavior:

As a user, you can:

Understanding these reports is important in HPC when squeezing performance out of inner loops.

Intrinsics and Low-Level Control

When auto-vectorization is insufficient, some HPC codes:

Example (conceptual, AVX for float addition):

#include <immintrin.h>
void add_avx(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n-8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // handle remaining elements (n % 8)
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

This gives full control over:

However, it ties code to a specific instruction set and is more complex to maintain, so it’s used selectively in performance-critical kernels.

Vectorization and Floating-Point Behavior

Vectorization can change:

Because floating-point arithmetic is not strictly associative:
$$
(a + b) + c \neq a + (b + c)
$$

Implications:

In HPC, it is common to:

Vectorization in the Wider HPC Context

Within an HPC node:

Vectorization is therefore:

Effective HPC applications:

Views: 60

Comments

Please login to add a comment.

Don't have an account? Register now!