2.2.1 Registers

Table of Contents

What Registers Are in the Memory Hierarchy

Registers are the fastest storage locations in a CPU and sit at the very top of the memory hierarchy. They are:

Located inside the CPU core
Extremely small in number and capacity (bytes, not kilobytes)
Directly accessed by almost every instruction
Invisible as “memory addresses” in normal code; accessed through instructions and compiler choices

In the hierarchy, you can think of:

$$\text{Registers} \ll \text{L1 cache} \ll \text{L2/L3 cache} \ll \text{RAM} \ll \text{disk}$$

Accessing a register is effectively “free” compared to any other memory access and does not induce cache misses.

Types of Registers (Conceptual)

Different CPU architectures define different exact register sets, but conceptually you will encounter:

General-purpose registers (GPRs)
Used for integer arithmetic, addresses, and general computations.
Floating-point registers
Used for floating-point operations (float, double).
Vector/SIMD registers
Used for operations on multiple data elements at once (vectorization). These are crucial for HPC performance.
Special-purpose registers

Program counter / instruction pointer (holds the address of the next instruction)
Stack pointer (points to the top of the current stack frame)
Status/flags register (holds condition codes like zero, carry, overflow)

From a high-level HPC perspective, the distinction that really matters is:

Scalar registers (single value)
Vector registers (many values packed together)

Registers and Instruction Execution

For a typical CPU instruction, operands must be in registers:

Values are loaded from memory (through caches) into registers.
Instructions perform computations on register values.
Results may be stored back from registers to memory.

Example (conceptual, not exact assembly):

; Load from memory to registers
LOAD R1, [A]      ; R1 = A
LOAD R2, [B]      ; R2 = B
; Compute in registers
ADD  R3, R1, R2   ; R3 = R1 + R2
; Store result to memory
STORE [C], R3     ; C = R3

All arithmetic and logical operations happen in registers; memory is only read/written via load/store.

Registers and Compiler Optimization

You don’t usually manipulate registers directly in high-level languages; the compiler decides what lives in registers. However, your coding style and compiler options strongly influence register usage:

Local variables in tight loops are prime candidates to stay in registers.
Global variables, pointer aliasing, and complex control flow make it harder for the compiler to keep values in registers.
Optimization flags (-O2, -O3, etc.) instruct the compiler to:

Allocate more variables in registers
Reorder computations to reuse register values efficiently
Unroll loops to use registers and pipelines more effectively

When a compiler cannot keep all needed values in registers, it performs a register spill: some values are temporarily stored to memory (typically the stack) and reloaded later. Spilling is much slower than staying entirely in registers.

For HPC, a key performance idea is:
Minimize register spills in hot (performance-critical) code sections.

Register Pressure

Register pressure is the demand for more registers than are physically available at a given point in the code.

High register pressure leads to:

Spills (extra memory traffic through caches and RAM)
Longer instruction sequences
Lower performance, especially in inner loops and vectorized kernels

Factors that increase register pressure:

Many live variables at once (e.g., large inlined functions, many temporaries)
Complex expressions or deeply unrolled loops
Aggressive vectorization using wide SIMD registers

Typical ways to help the compiler reduce register pressure in HPC code:

Simplify inner loops (fewer live variables at once)
Split long expressions into smaller steps
Avoid unnecessary temporaries in critical sections
Use compiler reports/options to inspect register usage when needed

Registers and Vectorization

Modern HPC CPUs have vector/SIMD registers that can hold multiple data elements:

Example (conceptual): a 256-bit register can hold:

4 double-precision (double) values, or
8 single-precision (float) values

Vector instructions operate on the entire register at once:

Add 4 doubles in one instruction
Multiply 8 floats in one instruction

For a simple loop like:

for (int i = 0; i < N; i++) {
    C[i] = A[i] + B[i];
}

The compiler can:

Load multiple A[i] and B[i] values into vector registers.
Perform vector additions using SIMD instructions.
Store the resulting vector register back to memory.

Effective use of vector registers is central to HPC performance; how this is exploited is covered more deeply under SIMD/vectorization concepts, but here the key point is that vector registers are just larger, specialized registers that enable parallel operations on data.

Registers, Function Calls, and the Stack

Function calls influence register usage:

Some registers are designated as caller-saved: the calling function must save/restore them if it wants their values preserved across the call.
Others are callee-saved: the called function must save/restore them if it uses them.

This convention:

Ensures that registers can be used by both caller and callee
Introduces overhead: saving and restoring registers to/from the stack

In performance-critical HPC kernels:

Deep call chains in inner loops can cause extra save/restore operations (more memory traffic).
It is common to:

Inline small functions in hot loops (letting the compiler manage registers globally)
Use simple, flat loop structures in core compute routines

Registers and Different Data Types

Different data types often occupy different registers or portions of a register set:

Integer operations use integer registers.
Floating-point operations use FP/vector registers.
Mixed-type expressions may require more instructions to move/convert values between register types.

From an HPC perspective:

Consistent data types in tight loops (e.g., all double) allow the compiler to generate cleaner, register-efficient code.
Unnecessary type conversions (e.g., repeatedly converting float to double) increase register usage and instruction count.

Practical Signals of Register Issues

Even without reading assembly, you can sometimes infer register-related problems from:

Compiler diagnostics (e.g., with “verbose” optimization reports) referencing “spilling” or high register usage.
Performance counters (via profiling tools) showing:

Unexpectedly high load/store activity
Lower-than-expected FLOP/instruction ratios for a compute-bound kernel

In small experiments, you might observe:

A loop becoming slower when you add “unnecessary” local variables or complex temporary expressions, because register pressure increased.
Improved performance when simplifying the loop body or using higher optimization levels, because the compiler keeps more data in registers.

Summary: Registers in HPC Context

Key points specific to registers in the memory hierarchy:

Registers are the fastest, smallest storage, directly inside each CPU core.
All arithmetic/logic happens on data in registers; memory is accessed only via loads/stores.
HPC performance depends critically on:

Keeping hot data in registers
Avoiding register spills
Efficiently using vector/SIMD registers

Code structure and compiler options strongly influence how well registers are used.

Understanding how registers fit into the hierarchy helps you reason about why seemingly small code changes or compiler flags can dramatically impact performance in high-performance computing.

Comments

Please login to add a comment.

Don't have an account? Register now!