2.2.1 Registers

Table of Contents

Role of Registers in the Memory Hierarchy

In the memory hierarchy, registers sit at the very top. They are the smallest, fastest storage locations directly inside the CPU. Every arithmetic or logical operation ultimately works on data that has been brought into registers, even if it originally comes from cache, main memory, or storage.

From an HPC perspective, efficient use of registers is crucial because they determine how quickly the CPU can feed its execution units. Whenever data must be fetched from somewhere slower than a register, such as cache or RAM, the processor may have to wait, which reduces performance. Understanding what registers are and how they are used helps make sense of compiler optimizations, vectorization reports, and many performance metrics that you will encounter later in the course.

What Registers Are

Registers are very small storage cells on the CPU chip, built from flip-flops in the processor’s logic circuitry. They hold a fixed number of bits, typically 32 or 64 bits for scalar general-purpose registers on modern 64-bit architectures, although there are also larger registers for vector operations.

A program does not directly manage hardware flip-flops. Instead, the processor’s instruction set architecture, or ISA, presents an abstract set of named registers such as rax, rbx, or xmm0 on x86_64, or x0, x1, v0 on ARM. Machine instructions specify which registers to read from and which registers to write.

Several key properties distinguish registers from other levels of memory:

First, capacity is extremely small. A core may have on the order of tens or at most low hundreds of architectural registers visible to software, far less than cache or RAM.

Second, speed is extremely high. Register access latency is usually one CPU cycle. For comparison, cache accesses may take several cycles, and main memory accesses may take hundreds of cycles.

Third, bandwidth is very high. Multiple registers can often be read and written in one cycle, allowing many arithmetic operations to proceed in parallel.

Fourth, registers are private per core. In mainstream CPU designs, each core has its own set of registers that cannot be directly accessed by other cores. This has implications for parallel programming and data sharing.

In short, registers are the CPU’s working area. Everything that happens during computation passes through them.

Types of Registers

For practical programming and performance tuning in HPC, you mainly need to recognize a few broad categories of registers. The exact naming and count differ between architectures, but the functions are similar.

General-Purpose Registers

General-purpose registers, often abbreviated as GPRs, are used to hold integer data, addresses, and occasionally other small values such as loop counters or flags. On a 64-bit architecture, a general-purpose register typically holds a 64-bit value.

When your high-level code manipulates integers, indexes arrays, or computes memory addresses, the compiler maps those values to general-purpose registers whenever possible. For example, the index variable i in a loop is normally kept in a register to allow rapid increment and comparison.

Conceptually, you can think of GPRs as the primary scratchpad for control flow and address arithmetic.

Floating-Point and Vector Registers

Floating-point registers are specialized for storing floating-point numbers and for use with floating-point arithmetic instructions. On modern ISAs, floating-point registers are often unified with vector registers. A single register file can hold both scalar floating-point values and vector values. For example, xmm and ymm registers on x86_64 can be used for scalar double precision as well as multiple packed double precision values.

Vector registers are larger, wide registers used for SIMD operations. A 256-bit vector register, for example, can hold four 64-bit doubles or eight 32-bit floats. Longer vector registers allow more elements to be processed in a single instruction, which gives higher peak floating-point throughput.

These registers are central to vectorization. When a loop is vectorized, operations that were originally scalar are grouped together so that one vector instruction can operate on multiple data elements simultaneously, all stored in a vector register.

Special-Purpose Registers

Special-purpose registers control or report on the state of the CPU. Some examples include:

Program counter or instruction pointer, which holds the address of the current or next instruction to execute.

Stack pointer, which tracks the top of the current stack frame.

Frame pointer or base pointer, which may help manage function call frames.

Status or flags registers, which hold condition codes such as zero, negative, carry, or overflow that are produced by arithmetic operations.

These registers influence control flow and function calls. You usually do not manipulate them explicitly in high-level languages, but they matter for how the compiler organizes code, calls functions, and returns from them.

Registers and Instruction Execution

Every CPU instruction interacts with registers in some way. A typical arithmetic instruction reads input registers, performs the operation, and writes the result to a destination register. For example, a simplified view of an instruction could be:

$$
\text{add r1, r2, r3} \quad \Rightarrow \quad r1 \leftarrow r2 + r3
$$

Here r2 and r3 are source registers and r1 is the destination register.

Load and store instructions move data between memory and registers. They are the bridge between the slow part of the memory hierarchy and the fast register file. For example:

$$
\text{load r1, [addr]} \quad \Rightarrow \quad r1 \leftarrow \text{Mem}[addr]
$$

$$
\text{store [addr], r1} \quad \Rightarrow \quad \text{Mem}[addr] \leftarrow r1
$$

From an HPC perspective, a key point is that arithmetic instructions cannot operate directly on arbitrary memory locations. Even if a machine instruction appears to use memory as an operand, the processor still moves data into internal registers before computation. The actual hardware always uses registers as the immediate source and destination for data.

Register Allocation by the Compiler

High-level languages like C, C++, or Fortran do not require you to name registers. Instead, the compiler performs register allocation, which is the process of deciding which program variables are kept in registers at what times.

Because the number of registers is limited, the compiler must share them among many variables, function arguments, loop indices, temporary values, and so on. If there are more live variables than available registers at some point in the code, the compiler must place some of them in memory. This process is called spilling.

Spilling occurs when the compiler saves a register’s contents to memory in order to reuse that register for a different variable and later reloads that data from memory when it is needed again. These extra loads and stores increase memory traffic and can hurt performance, especially in tight loops.

Compilers use sophisticated algorithms and heuristics to minimize spilling, with particular focus on performance-critical regions such as inner loops. For performance-sensitive HPC codes, the pattern of variable usage, the complexity of loop bodies, and how frequently function calls appear inside loops can all influence register pressure and spilling.

Register pressure is the situation where the demand for registers exceeds the available supply, which leads to spilling. Minimizing register pressure inside hot loops is a key objective for high-performance code.

Understanding that register allocation and spilling exist helps interpret optimization reports and performance profiles. It is also relevant when reading compiler assembly output, if you ever examine it.

Registers and Vectorization

Vectorization uses wide registers to operate on multiple data elements in parallel. In this context, the size and number of vector registers become important performance factors.

For a vector register of width $W$ bits and an element size of $b$ bits, the number of elements processed per vector instruction is

$$
N = \frac{W}{b}
$$

For example, if $W = 256$ bits and $b = 64$ bits for double precision, then $N = 4$. A single vector instruction can add four pairs of doubles that are stored in one vector register each.

The compiler must map arrays and loop variables into vector registers while also maintaining scalar registers for loop control, address calculation, and other temporary values. This combined demand can increase register pressure. If pressure becomes too high, the compiler may choose not to vectorize certain loops or may generate less efficient code.

From an HPC programmer’s perspective, simple loop bodies with straightforward memory access patterns are more likely to vectorize cleanly, because they allow better use of vector registers with minimal spilling.

Register Usage and Latency Hiding

Modern CPUs can execute multiple instructions per cycle and often allow instructions to be issued out of their original program order, as long as dependencies are respected. This ability to reorder and overlap operations is an important mechanism for hiding latencies, such as the time it takes for results of an instruction to become available.

Registers are at the center of this process. Instruction scheduling and out of order execution rely on the dependencies between registers. If one instruction produces a value in a register and a later instruction consumes that register, there is a dependency. The CPU must respect this order, which can prevent some overlaps. If there are enough independent registers and instructions that use distinct registers, the CPU can schedule them in parallel to keep pipelines busy.

As a result, code with many independent computations that use different registers can achieve higher throughput because the CPU can overlap them. Conversely, code that repeatedly reads and writes the same few registers with strict dependencies may see more stalls.

Compilers try to reorder instructions and allocate registers in a way that enables the CPU’s out of order engine to hide latencies. This is particularly important in inner loops of HPC codes.

Registers and Function Calls

Function calls have an impact on register usage. At a call boundary, the program must follow a calling convention, which is a platform-specific set of rules about how arguments are passed, which registers are preserved across calls, and which registers may be overwritten by the callee.

Typically, some registers are designated as caller-saved and others as callee-saved. If a function wants to keep values in caller-saved registers across a call, it must save them to memory or to callee-saved registers before calling and then restore them afterwards. This saving and restoring also involves memory traffic and may increase register pressure.

For performance-critical HPC kernels, inlining small functions can sometimes reduce the need for such save and restore sequences, because the compiler can then manage registers across what used to be a call boundary. This can both reduce overhead and provide more freedom for optimization.

Registers in Multithreaded and Parallel Contexts

On shared-memory systems with threads, each hardware core has its own set of physical registers. Each thread running on a core uses those registers while it is scheduled on that core. During a context switch, when the operating system moves a thread off a core or replaces it with another thread, it saves the contents of the registers to memory and restores the other thread’s registers.

From a programmer’s standpoint, this means that registers are private to a thread at any moment and are not shared. Data that must be visible to other threads must reside in memory, not only in registers. Compilers and memory models ensure that stores to shared variables are eventually written out to memory from registers and made visible to other threads under the rules of the language and any synchronization used.

In distributed-memory contexts with MPI, each process has its own address space and its own registers. There is never any sharing of registers between ranks. Registers serve only as local working storage inside each process.

Observing Register Effects Indirectly

In most HPC application development, you do not manipulate registers directly, but you can see their effects indirectly through several tools and reports.

Compiler reports can indicate when loops fail to vectorize or when excessive register pressure is present. Specific compiler flags can request such reports.

Assembly output, which the compiler can emit, shows explicit register usage. You can see which variables are held in which registers and where spills occur. While not always necessary, this can be educational.

Hardware performance counters, accessible through profiling tools, can reveal metrics related to load and store operations, cache misses, and other events. Many of these are influenced by how well the compiler mapped data to registers versus memory.

By understanding that registers are the first and fastest level of the memory hierarchy, and that their limited number can drive compiler decisions, you gain a more informed view of why certain code structures perform better than others on HPC systems.

Comments

Please login to add a comment.

Don't have an account? Register now!