2.5 SIMD and vectorization concepts

Table of Contents

Introduction

Single Instruction Multiple Data, usually abbreviated as SIMD, is one of the most important concepts for performance on modern CPUs and many accelerators. It describes how one instruction can operate on several data elements at once. For high performance computing, understanding SIMD and basic vectorization concepts is essential, because modern compilers and libraries depend heavily on it to reach the real peak performance of the hardware.

This chapter focuses on what SIMD is, how it differs from ordinary scalar execution, how vector instructions are organized in real processors, and what a beginner needs to know to write code that can be efficiently vectorized.

From Scalar Execution to SIMD

On a purely scalar processor, each instruction operates on a single data element. For example, an instruction like ADD R1, R2, R3 might compute R1 = R2 + R3, where each register holds one number. If you want to add two arrays element by element, you need a loop that repeats this scalar instruction many times.

SIMD takes a different approach. Instead of applying an addition to a single pair of numbers, a SIMD instruction applies the same operation to multiple elements that are packed together in special vector registers. Conceptually, you can think of a SIMD add instruction that operates on small arrays of numbers in one shot.

A scalar add takes something like:

$$ c = a + b $$

A SIMD add can perform, for example, four single precision additions at once:

$$ \mathbf{c} = \mathbf{a} + \mathbf{b} $$

where each bold symbol represents a vector of 4 elements. The same instruction is issued once, but several arithmetic operations happen in parallel inside the CPU.

SIMD in the Context of the Hardware

Modern CPUs contain vector units in addition to their scalar execution units. These vector units operate on wide registers, for example 128 bits, 256 bits, or 512 bits at a time. The exact width depends on the instruction set extension, such as SSE, AVX, or AVX-512 on x86 processors, or NEON and SVE on ARM processors.

A 256 bit vector register can hold, for example, 4 double precision numbers of 64 bits each, or 8 single precision numbers of 32 bits each. A single SIMD instruction can then add or multiply those 4 or 8 values in one cycle, if all goes well, instead of needing 4 or 8 separate scalar instructions.

The clock speed of the CPU does not change when SIMD is used. Instead, SIMD increases the amount of useful work done per clock cycle by packing more data into each instruction.

Vector Registers and Data Types

SIMD operates on vector registers. These are not the usual scalar registers used to hold single values, but wide registers that hold multiple elements of the same type. A particular vector register might be able to represent:

Several 32 bit floating point numbers.
Several 64 bit floating point numbers.
Several 8, 16, or 32 bit integer values.

The layout is always contiguous. For example, with a 256 bit vector register and 64 bit double precision numbers, the register can hold 4 values:

$$ \mathbf{a} = (a_0, a_1, a_2, a_3) $$

An addition of two such vector registers

$$ \mathbf{c} = \mathbf{a} + \mathbf{b} $$

means

$$ c_i = a_i + b_i \quad \text{for} \quad i = 0,\dots,3 $$

The important point is that the same instruction operates on all these elements at once, and all these elements must be of the same type and arranged in a way that the hardware expects.

Basic SIMD Operations

Although vector instruction sets are large and sometimes complex, the core operations are conceptually simple and similar across architectures.

The most common categories are:

Load and store operations, which move data between memory and vector registers.
Arithmetic operations, such as addition, subtraction, multiplication, and sometimes fused multiply add (FMA). An FMA computes $a \times b + c$ in a single instruction and often with a single rounding step.
Logical operations, typically bitwise AND, OR, XOR, and NOT, which work on integer or bit patterns in the vector.
Comparison operations, such as equal, less than, or greater than, which produce masks describing which elements satisfy the condition.
Permute and shuffle operations, which rearrange elements inside or across vector registers.
Reduction operations, like computing the sum or maximum of all elements in a vector register.

In most real codes, the performance critical parts boil down to loops that repeatedly perform arithmetic on arrays. Compilers translate those loops into patterns of vector loads, arithmetic on vectors, and vector stores.

Vectorization by the Compiler

Programmers almost never write raw vector instructions in typical high-level languages like C, C++, or Fortran. Instead, the compiler analyzes loops and attempts to convert scalar code into vector code. This automatic transformation is called auto-vectorization.

The compiler has to prove that it is safe to treat several loop iterations as independent operations that can be grouped into a single SIMD instruction. If the compiler is unsure about data dependencies or pointer aliasing, it may refuse to vectorize a loop, even if a human could see that it is safe.

A simple example of a vectorizable loop in C is:

for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

Here, each iteration uses different elements of the arrays and does not depend on the result of any other iteration. Compilers are usually able to detect this pattern and generate a sequence of SIMD operations, each operating on multiple elements of a, b, and c in one instruction.

In contrast, a loop where each iteration depends on the previous result, such as a simple prefix sum, is much harder to vectorize automatically, because the operations are not independent.

Auto-vectorization only works when loop iterations are independent, memory accesses are predictable and regular, and the compiler can be sure that there are no conflicting data dependencies.

Alignment and Memory Access Patterns

Vector loads and stores are most efficient when data in memory is aligned to boundaries that match the vector width. For example, a 256 bit load instruction may work best if the address is a multiple of 32 bytes. Misaligned loads and stores can sometimes be slower, because the hardware might need to perform extra work internally.

From a programming point of view, the most important rule is to store numerical data in contiguous arrays, and to access them with simple patterns such as a[i], b[i], and c[i]. Strided access, where only every k-th element is used, or random access through pointers or indices, is typically more expensive and can prevent vectorization.

Compilers like simple, predictable memory patterns. The classic example is the inner loop of a matrix operation that walks linearly through memory. Complex indexing inside loops, pointer arithmetic that is hard to reason about, or aliasing between different pointers can all inhibit vectorization.

Masks and Conditional Operations

Real code often contains conditional statements inside loops. A naive implementation might branch for each element, but that does not map directly to SIMD, because all elements in a vector share the same instruction.

SIMD architectures use masking to represent which elements in a vector are active for an operation. A mask is a set of bits, one per element, that indicates where the instruction should apply. For example, a comparison a[i] > 0 for a vector a might produce a mask

$$ \mathbf{m} = (m_0, m_1, \dots, m_{k-1}) $$

where $m_i$ is true when $a_i > 0$ and false otherwise. Subsequent vector operations can use this mask to update only the elements where the condition holds.

In high-level code, this often appears as conditional assignments or ternary operators, which compilers can translate into masked vector instructions when possible. Complex control flow with early exits or data dependent loops continues to be difficult for vectorization.

Reductions and Vectorization

Reductions are operations that combine many elements into a single result. Examples include sums, products, minima, maxima, and norms. These patterns appear frequently in numerical algorithms and scientific codes.

At first sight, a reduction seems to contain dependencies, because each step accumulates the result from previous steps. However, many reductions are associative and commutative in exact arithmetic. This allows their execution to be rearranged.

SIMD implementations of reductions use tree-like patterns to combine elements in parallel. A vector sum, for example, might start by summing pairs of elements inside the vector, then pairs of the results, and so on, until a single value remains. In source code, this is usually hidden. The compiler recognizes recognized reduction idioms, such as:

double sum = 0.0;
for (int i = 0; i < n; i++) {
    sum += a[i];
}

and generates vectorized reduction code. However, not all compilers can recognize all patterns automatically, and the programmer sometimes has to help by writing the code in a style the compiler can understand, or by using hints such as pragmas.

One consequence of vectorized reductions is that the order of floating point operations changes. Due to finite precision, this can lead to slightly different results compared to a purely scalar evaluation. In many HPC contexts, these small differences are acceptable and are considered a normal consequence of numerical computation.

Intrinsics and Explicit Vectorization

While auto-vectorization is convenient, it is not always sufficient. Some applications require more control over the generated SIMD instructions. For those cases, modern compilers provide intrinsics, which are functions that correspond almost directly to individual vector instructions.

For example, an intrinsic might expose a specific vector add operation as a function like:

__m256d _mm256_add_pd(__m256d a, __m256d b);

where __m256d is a type that represents a 256 bit vector of double precision numbers. Calling this intrinsic in C or C++ will instruct the compiler to emit the specific vector addition instruction. Similar intrinsics exist for loads, stores, shuffles, and many other operations.

Using intrinsics provides more predictable vectorization, but it tightly couples the code to a specific instruction set, such as AVX or NEON, and makes the program less portable and more complex. For beginners, understanding that intrinsics exist is useful, but writing large codes with them is usually not necessary at the start.

Vectorization-Friendly Programming Practices

Even without writing intrinsics, programmers can help the compiler to generate better SIMD code by following a few practical rules. These are not full performance tuning strategies, but patterns that make vectorization more likely and more effective.

Keep most heavy numeric work inside simple loops that operate on contiguous arrays of basic numeric types. Avoid mixing complex control flow with performance critical arithmetic. Separate the computations into a core loop that is as regular and simple as possible, and move exceptional cases out of that loop.

Write loops with a clear structure. Use straightforward for loops where the bounds and increments are known to the compiler. Avoid modifying loop indices inside the body, and avoid hidden data dependencies.

Reduce pointer aliasing. If the compiler cannot be sure that a[i] and b[i] refer to different memory locations, it may refuse to vectorize. Relevant language features or compiler-specific hints exist to help communicate non-aliasing information, and using separate arrays instead of overlapping structures can also help.

Understand that operations like divisions, square roots, and transcendental functions are often more expensive than additions and multiplications, and may vectorize differently or with lower throughput. Many math libraries provide vectorized implementations of functions like exp, sin, and cos, which can be used to gain SIMD speedups even in more complex code.

To benefit from SIMD, numerical kernels should use regular loops over contiguous data, avoid unnecessary data dependencies and aliasing, and keep control flow as simple as possible inside performance critical loops.

SIMD Speedup and Limits

If a vector register holds $W$ elements of a data type, in theory, SIMD can speed up arithmetic operations by up to a factor of $W$ compared to scalar code. For example, a 256 bit vector register with 32 bit floats can hold 8 elements, which suggests a potential speedup of up to 8 times for well vectorized operations.

However, real applications rarely reach this ideal limit. Several factors constrain the actual speedup.

First, memory bandwidth may limit how quickly data can be loaded and stored. If the computation per element is low, then memory access cost dominates and vector arithmetic units may be underused.

Second, not all parts of the code can be vectorized. Only certain loops with the right structure will benefit. Other parts remain scalar and reduce the overall speedup.

Third, overhead for handling leftover elements when the array length is not exactly divisible by the vector width, and for dealing with misaligned or irregular data, can reduce the effective gain.

Understanding these limitations helps set realistic expectations. SIMD is a powerful tool that can provide significant speedups for the right kind of numerical loop, but it is not a magic solution for every algorithm.

SIMD Across Architectures

Although the broad SIMD idea is consistent, each architecture has its own style of vector instructions and its own width of vector registers. For example, x86 CPUs evolved from early 64 bit vector units to 128 bit SSE, then 256 bit AVX, and now 512 bit AVX-512. ARM-based CPUs provide NEON and the more recent SVE architecture, which allows flexible vector widths.

Programming models and compilers aim to shield most users from these differences. High level code written in C, C++, or Fortran can remain portable, while the compiler generates different vector instructions depending on the target machine. Libraries that are tuned for specific architectures also use vectorization internally, so that applications benefit without needing to know the details.

However, from an HPC perspective, it is important to recognize that different systems may have different effective vector widths, and that portable code which is written in a vectorization-friendly style allows compilers and libraries to exploit these differences as effectively as possible.

Summary

SIMD and vectorization are about performing the same operation on multiple data elements with a single instruction, using special vector hardware units inside modern processors. In high performance computing, this is a central mechanism for increasing throughput per core.

Compilers attempt to automatically translate scalar loops into vector instructions when they can be sure that the transformation is safe. The ability to do so depends on data independence, simple and regular memory access patterns, limited and manageable control flow, and well understood reduction patterns.

Programmers influence vectorization mainly by the way they structure their code and organize data. Keeping numerical kernels simple, regular, and focused, and storing data in contiguous arrays, creates opportunities for SIMD. Although the theoretical speedup is bounded by the vector width, practical gains can be large for well designed loops and algorithms.

Understanding these concepts at a conceptual level prepares beginners for later chapters that deal with performance analysis, optimization, and programming models that build on these low level capabilities.

Comments

Please login to add a comment.

Don't have an account? Register now!