Table of Contents
What SIMD Means in Practice
SIMD stands for Single Instruction, Multiple Data. It describes hardware that applies the same operation to many data elements in parallel.
At the lowest level, a SIMD unit operates on vectors of data:
- A scalar operation:
c = a + badds one pair of numbers. - A SIMD vector operation:
C = A + Badds multiple pairs of numbers in one instruction: - If the SIMD width is 4 for
float, a single instruction computes:
$$
C_0 = A_0 + B_0, \quad
C_1 = A_1 + B_1, \quad
C_2 = A_2 + B_2, \quad
C_3 = A_3 + B_3
$$
On modern CPUs this is implemented by:
- Wide registers (e.g., 128-bit, 256-bit, 512-bit)
- Vector instructions that operate on multiple numbers at once
This is data-level parallelism inside a single core, distinct from multi-core or multi-node parallelism.
SIMD Width and Data Types
The number of elements processed in parallel depends on:
- The SIMD register width (e.g., 128-bit, 256-bit, 512-bit)
- The data type size (e.g., 32-bit floats, 64-bit doubles, 32-bit ints)
Example for a 256-bit SIMD register:
- 256 / 32 = 8 single-precision floats processed at once
- 256 / 64 = 4 double-precision floats processed at once
Typical desktop/server SIMD extensions:
- SSE: 128-bit
- AVX: 256-bit
- AVX-512: 512-bit
- ARM NEON: usually 128-bit
- ARM SVE: scalable width (e.g., 128–2048 bits, implementation-dependent)
For HPC, wider vectors generally mean higher peak throughput, but also stricter requirements on data layout and alignment.
Vectorization vs SIMD
- SIMD: hardware capability (vector registers + vector instructions)
- Vectorization: transforming code so it uses SIMD hardware
Vectorization can be:
- Automatic (auto-vectorization): the compiler converts loops to use SIMD instructions
- Manual:
- Using intrinsics (functions that map directly to SIMD instructions)
- Using libraries that are already vectorized
- Giving hints/directives to the compiler (e.g.,
#pragma omp simd)
Vectorization is about rewriting or annotating code so the CPU can apply one instruction to many data elements.
Typical SIMD-Friendly Patterns
The most SIMD-friendly patterns are:
- Simple loops with independent iterations, for example:
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Each iteration uses a[i], b[i], c[i] only; iterations do not depend on each other.
Common HPC kernels that often vectorize well:
- Element-wise vector operations: addition, multiply, fused multiply-add
- Dot products and simple reductions (with some extra handling)
- Stencil operations with regular neighbor accesses (if carefully written)
- Matrix-vector multiply and many linear algebra kernels
Speedup Potential and Limits
If a SIMD unit processes $W$ elements per instruction, the ideal speedup from vectorization is up to $W\times$ for the vectorized part of the code.
Example:
- 256-bit SIMD with single-precision floats: $W = 8$
- A loop dominated by arithmetic on arrays of floats may (ideally) run up to 8× faster when fully vectorized.
In practice, actual speedup is limited by:
- Memory bandwidth (can you feed data to SIMD fast enough?)
- Instruction mix (non-vectorizable operations, branches)
- Overheads in handling remainders (loop tails not divisible by $W$)
- Imperfect auto-vectorization
Conditions for Successful Vectorization
For a compiler to safely vectorize a loop, several conditions must hold:
No Loop-Carried Dependencies
Loop iterations must be independent; that is, later iterations must not depend on results produced by earlier iterations.
Example (not vectorizable as-is):
for (int i = 1; i < n; i++) {
a[i] = a[i] + a[i-1]; // depends on previous element
}Example (vectorizable):
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i]; // no cross-iteration dependence
}If there are dependencies, some can be handled with more advanced vectorization techniques, but those are harder and less universal.
No Unsafe Memory Aliasing
The compiler must be sure that different pointers do not refer to the same memory, or vectorization might change program behavior.
Example that may confuse a compiler:
void add_arrays(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
If a and b or c can overlap, vectorizing may be unsafe. In C/C++, restrict can help:
void add_arrays(float * restrict a,
const float * restrict b,
const float * restrict c,
int n) {
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
}This tells the compiler there is no aliasing between these pointers.
Predictable Control Flow
Branches inside vectorized loops reduce SIMD efficiency, because different elements may “want” different paths.
Better:
- Remove branches by precomputing masks
- Use conditional assignments (
select/blend) instead ofifwhen possible - Separate different cases into different loops
Example:
for (int i = 0; i < n; i++) {
if (x[i] > 0.0f) {
y[i] = x[i];
} else {
y[i] = 0.0f;
}
}Vectorized version (conceptually):
- Compute mask:
m[i] = (x[i] > 0) y[i] = m[i] ? x[i] : 0.0
Modern SIMD instruction sets support masked operations to implement this.
Regular, Contiguous Memory Access
SIMD is most efficient when accessing data that is:
- Contiguous in memory (unit stride)
- Properly aligned
- Stored in simple arrays, not scattered in complex structures
Array-of-Structures (AoS) is often less SIMD-friendly:
typedef struct {
float x, y, z;
} Point;
Point p[n];
for (int i = 0; i < n; i++) {
p[i].x = p[i].x * 2.0f;
}Structure-of-Arrays (SoA) is more SIMD-friendly:
typedef struct {
float *x;
float *y;
float *z;
} Points;
Points p;
for (int i = 0; i < n; i++) {
p.x[i] = p.x[i] * 2.0f; // contiguous array
}In HPC, data layout is often changed to SoA or mixed layouts to enable better vectorization.
Memory Alignment and SIMD
Many SIMD instruction sets have:
- Aligned load/store instructions: assume data starts at an address aligned to the vector width (e.g., 32-byte boundary for 256-bit vectors)
- Unaligned load/store instructions: handle arbitrary addresses but may be slower (depending on architecture)
For best performance:
- Align arrays to appropriate boundaries (e.g., 32 or 64 bytes)
- Use allocators or language features that support alignment (e.g., aligned
malloc, compiler extensions, or aligned attributes)
If alignment is unknown, compilers may:
- Use slower unaligned instructions
- Emit prologue code to “peel” loop iterations until data becomes aligned, then use aligned loads
Auto-Vectorization by Compilers
Modern compilers (GCC, Clang/LLVM, Intel, etc.) attempt auto-vectorization when optimization is enabled (e.g., -O2, -O3).
Typical behavior:
- Analyze loops
- Check for dependencies, aliasing, and control flow
- Decide whether vectorization is safe and profitable
- Generate SIMD instructions if possible
As a user, you can:
- Inspect compiler reports or flags that show vectorization decisions (e.g.,
-ftree-vectorizer-verboseor equivalents) - Slightly modify loops (or data structures) to make vectorization easier
- Use directives (e.g.,
#pragma omp simd,#pragma ivdep) to assert safety when the compiler cannot prove it, but you know it’s safe
Understanding these reports is important in HPC when squeezing performance out of inner loops.
Intrinsics and Low-Level Control
When auto-vectorization is insufficient, some HPC codes:
- Use intrinsics: C/C++ functions that map directly to SIMD instructions (e.g.,
_mm256_add_psfor AVX)
Example (conceptual, AVX for float addition):
#include <immintrin.h>
void add_avx(float *a, float *b, float *c, int n) {
int i;
for (i = 0; i <= n-8; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
// handle remaining elements (n % 8)
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}This gives full control over:
- Which SIMD instructions are used
- Alignment decisions
- Masking and special cases
However, it ties code to a specific instruction set and is more complex to maintain, so it’s used selectively in performance-critical kernels.
Vectorization and Floating-Point Behavior
Vectorization can change:
- The order of operations
- How rounding errors accumulate
Because floating-point arithmetic is not strictly associative:
$$
(a + b) + c \neq a + (b + c)
$$
Implications:
- Results may differ slightly between scalar and vectorized versions
- Reductions (sum, dot product) are especially sensitive
In HPC, it is common to:
- Accept small, explainable differences within a numerical tolerance
- Or use stricter options if bitwise reproducibility is needed (often at some performance cost)
Vectorization in the Wider HPC Context
Within an HPC node:
- Each core uses SIMD/vectorization to exploit data-level parallelism
- Many cores on the CPU exploit thread/process-level parallelism
- Nodes exploit cluster-level parallelism via message passing
Vectorization is therefore:
- The lowest-level, per-core performance lever
- A prerequisite for approaching the advertised peak FLOPs of a CPU
Effective HPC applications:
- Are written (or structured) with vectorization in mind
- Use numerical libraries that are heavily hand-vectorized
- Pay close attention to data layout and loop structure in performance-critical sections