10.2 GPU architecture basics

Table of Contents

The Basic Shape of a GPU

GPUs are designed to run very many lightweight threads in parallel. While a CPU is optimized for low-latency execution of a small number of complex threads, a GPU is optimized for high-throughput execution of a huge number of simple, similar threads.

A modern GPU consists of multiple identical processing units, a specialized memory hierarchy, and hardware that schedules and hides latency by quickly switching between groups of threads. Vendors use different names for some of these components, but the overall structure is very similar across architectures.

The key ideas are massive parallelism, groups of threads that execute in lockstep, and high-bandwidth but sometimes high-latency memory that needs careful management to obtain good performance.

A GPU is a throughput-oriented processor that relies on many lightweight threads and lockstep execution of thread groups to hide memory latency and achieve high performance.

Streaming Multiprocessors and Cores

At the heart of a GPU are its streaming multiprocessors. Each streaming multiprocessor, often abbreviated as SM (NVIDIA) or compute unit (AMD), contains many simple arithmetic cores, small memories, and control logic. When you launch a GPU kernel, the GPU runtime maps your work onto these SMs.

Inside each SM there are:

Scalar or vector arithmetic units that perform integer and floating point operations.

Special function units for operations such as trigonometric functions.

Registers that hold per-thread data very close to the arithmetic units.

Small on-chip memories that support sharing between threads and buffering of data.

Different generations and vendors vary in the exact number of arithmetic units per SM and the supported instruction sets, but the programming model is similar: your total work is decomposed into many threads, these threads are grouped into sets (blocks, workgroups), and blocks are scheduled onto SMs as resources become available.

The number of blocks that can run concurrently on an SM is limited by hardware resources such as registers and shared memory. This limit is a central concept in GPU performance and is often referred to as occupancy, which you will study in more detail when discussing performance considerations.

Thread Hierarchy and Execution Groups

To understand GPU architecture in practice, you must understand how threads are grouped, scheduled, and executed in lockstep.

Most GPU programming models expose a three-level hierarchy:

Individual threads are the smallest units of work, each with its own registers and local variables.

Threads are grouped into blocks (CUDA) or workgroups (OpenCL and others). Threads in the same block can synchronize with each other and can share a small on-chip memory region.

Blocks are organized into a grid that represents a single kernel launch.

On the hardware side, each block runs on a single SM. The SM then divides the threads into smaller fixed-size groups that execute in lockstep. These groups are called warps on NVIDIA hardware and wavefronts or wavefronts on AMD.

A warp has a fixed width. For example, on many NVIDIA GPUs, a warp contains 32 threads. The hardware fetches and executes one instruction for all active threads in a warp at the same time. If all threads in the warp follow the same control path, execution is efficient. If some threads take one branch and some take another, the GPU must serialize parts of the work, as discussed below.

Threads are scheduled and executed in fixed-size groups (warps or wavefronts). All threads in a group execute the same instruction at the same time, which is often described as SIMT (Single Instruction, Multiple Threads).

This SIMT model is related to SIMD and vectorization concepts that you will see elsewhere in the course, but the GPU programming model presents independent threads instead of explicit vector registers. Internally, the GPU maps these threads onto vector hardware.

Control Flow and Branch Divergence

Because warps execute in lockstep, control flow has a direct impact on performance. Within a given warp, if every thread follows exactly the same instruction path, the hardware can keep all arithmetic units busy. If different threads within a warp need to follow different branches of an if or loop structure, the GPU handles this by masking and serialization.

Consider a warp of 32 threads with a branch:

if (condition) {
    // path A
} else {
    // path B
}

If some threads evaluate condition as true and others as false, the GPU must execute path A with only the threads that took that path active. Then it executes path B with the remaining threads active. The total number of instructions executed is roughly the sum of both paths, and part of the warp is idle at any given time.

This effect is known as branch or control flow divergence. It does not change the correctness of the program but can significantly reduce throughput.

Branch divergence is local to a warp. Different warps can execute different code paths without affecting each other. When writing GPU code, it is common to design algorithms so that threads in the same warp tend to follow similar control paths.

GPU Memory Hierarchy: An Overview

GPUs also rely on a memory hierarchy, but with different emphasis than CPUs. GPU designs prioritize bandwidth and parallel access patterns over very low latency for single operations. From the perspective of a kernel, you typically see several main memory regions.

Global memory is large, off-chip DRAM that is visible to all threads but has high latency. This is where most of your data arrays live.

Shared memory is a small on-chip memory region shared by all threads in the same block. It has much lower latency than global memory and much higher effective bandwidth, but its size is limited, often on the order of tens of kilobytes per SM.

Registers are private to individual threads and are the fastest storage. Each thread has its own set of registers for local variables. The total number of registers per SM is fixed, so using many registers per thread can reduce how many threads the SM can run concurrently.

In addition to these, GPUs typically expose constant memory and texture or read-only cache paths that are optimized for specific access patterns, such as broadcast of the same value to many threads or spatially coherent reads.

From slowest and largest to fastest and smallest, a typical GPU memory hierarchy is:
Global memory > Shared memory > Registers, with additional specialized read-only caches.

Global Memory Access and Coalescing

The way threads access global memory has a crucial impact on performance. GPUs are designed to transfer data in wide, aligned bursts that serve multiple threads at once. When threads in a warp access addresses that lie close together, the hardware can combine these accesses into a small number of wide memory transactions. This is known as coalesced access.

For example, suppose 32 threads in a warp each load A[i], and the indices i are consecutive integers. If the data is laid out contiguously and aligned, the GPU may satisfy the entire warp’s request with only a few memory transactions. This yields high effective bandwidth.

If the threads in a warp access scattered or irregular addresses, the hardware may need to perform many separate transactions, which wastes bandwidth and increases latency. This is called uncoalesced access and is a common source of poor performance in GPU code.

Although the exact rules for coalescing depend on the specific architecture, the basic idea is to organize data and thread indices so that threads with neighboring IDs in a warp access neighboring memory locations.

Shared Memory and Data Sharing Within a Block

Shared memory occupies a special place in GPU architecture. It is implemented using on-chip SRAM and is directly connected to the SM’s arithmetic units. It is designed to support two important patterns: reuse of data by many threads in a block and fast data exchange between threads.

Threads in the same block can read and write shared memory, with well-defined synchronization primitives to ensure that writes by one thread are visible to others at the right time. By loading a chunk of global data into shared memory once and then reusing it multiple times within a block, you can reduce the number of high-latency global memory accesses.

Shared memory is physically organized into banks, which can be accessed in parallel. Each bank typically serves words at aligned addresses. If multiple threads in a warp access different addresses within the same bank at the same time, a bank conflict can occur. The hardware then serializes part of the access, similar in spirit to branch divergence but for memory. If each thread accesses a different bank, or if all threads read the same address (a broadcast), accesses are conflict free.

The limited capacity of shared memory per SM, and its organization into banks, are architectural constraints that shape common GPU algorithms such as tiled matrix multiplication and stencil computations.

Registers and Local Memory

Registers are the fastest storage location available to a thread on a GPU. They hold scalar or small vector values and are used for intermediate computations and local variables. Each SM has a fixed register file, and the number of registers allocated per thread is determined at compile time by the compiler based on your kernel.

If each thread uses many registers, fewer threads can reside on the SM at the same time. This reduces occupancy and may make it harder for the hardware to hide memory latency. On the other hand, using too few registers may force the compiler to spill values to a memory region sometimes called local memory, which is actually a portion of global memory private to the thread and much slower than registers.

There is an architectural balance between using enough registers to avoid spills and leaving enough registers unused so that many threads can be scheduled concurrently. This balance depends on the compute to memory ratio of the kernel and is an important aspect of performance tuning.

Caches and Specialized Read-Only Paths

Modern GPUs also include cache hierarchies to reduce the effective latency of global memory. Typically there is a per-SM L1 cache or combined shared-memory and L1 subsystem, and a device-wide L2 cache. The details vary across architectures and vendors, but the general role is similar to CPU caches: to exploit temporal and spatial locality in memory access patterns.

In addition, GPUs expose specialized read-only memory paths such as constant memory and texture or read-only caches. Constant memory is optimized for broadcast to many threads, for example, when many threads read the same coefficient. Texture or read-only caches can provide benefits for irregular or spatially coherent access patterns, such as neighbor lookups in grids or meshes.

Although these are hardware features, they are often visible in the programming model through distinct memory qualifiers or APIs. Understanding their architectural purpose helps you choose the right memory space for different kinds of data.

Latency Hiding and Occupancy

A central architectural strategy in GPUs is latency hiding. Instead of making each individual memory operation very fast, GPUs switch between many active warps when one warp is stalled waiting for data. This is often described as hardware multithreading.

Each SM maintains a pool of ready warps. When a warp issues a memory access with high latency, the scheduler can switch to another warp that is ready to execute arithmetic instructions. From the programmer’s perspective, this means that a single long-latency access can be overlapped with useful work from other warps, as long as there are enough active warps to choose from.

The number of warps that can reside on an SM is limited by architectural resources such as registers, shared memory, and the maximum number of threads or blocks per SM. The fraction of the theoretical maximum number of warps that are actually active for a given kernel is called occupancy.

GPUs hide memory latency by switching between many active warps. Sufficient occupancy is needed so that some warps are always ready to run while others wait for memory.

High occupancy is often helpful but not always sufficient for good performance. If your kernel is heavily limited by memory bandwidth or instruction throughput, other factors may dominate. However, the architectural dependence of latency hiding on active warps is a core concept.

Interconnects and Device Integration

GPUs do not usually replace CPUs in an HPC node. Instead, they are attached as accelerators. The way the GPU is connected to the host system is an important architectural feature that affects data transfer performance.

Discrete GPUs are typically connected over PCI Express. This bus provides relatively high bandwidth compared to general-purpose I/O but is still much slower and has higher latency than on-device memory bandwidth. Some systems include additional high-bandwidth links between GPUs or between CPU and GPU, such as NVLink or proprietary interconnects, to reduce communication bottlenecks for multi-GPU or tightly coupled workloads.

There are also integrated or unified memory architectures where the CPU and GPU share the same physical memory, or at least a unified address space. In such systems, some data movement is handled by the hardware and runtime, but the underlying bandwidth and latency characteristics still follow the general GPU pattern: high throughput for well-structured access, with a strong dependence on locality.

From an architectural point of view, this host device separation means that data transfer between CPU and GPU can become a significant component of the end-to-end runtime of an HPC application, even if the kernel itself runs extremely fast on the GPU.

Precision, Specialized Units, and Mixed-Use Architectures

Modern HPC-oriented GPUs include specialized arithmetic units in addition to the standard FP32 and FP64 pipelines. Examples include tensor cores or matrix cores that perform small matrix multiply accumulate operations at very high throughput, often in reduced precision formats such as FP16, bfloat16, or other low-precision encodings.

These units are architecturally distinct from the general-purpose ALUs and can deliver much higher peak performance when used appropriately, especially in linear algebra and machine learning workloads. Their presence affects how GPUs are used in HPC for dense linear algebra, mixed-precision solvers, and AI-based components integrated into simulation pipelines.

The balance of FP32, FP64, and specialized units varies across product lines. Consumer GPUs may favor graphics and AI workloads with less emphasis on double precision, while HPC GPUs provide stronger FP64 performance. When you study numerical libraries and performance optimization, this architectural variation will be important, but at the architectural level it is enough to note that a GPU is not a single homogeneous set of cores. It includes multiple types of execution units optimized for different precision levels and operation types.

Summary

GPU architecture is built around the idea of running large numbers of lightweight threads in parallel, grouped into warps that execute in lockstep on streaming multiprocessors. The memory hierarchy emphasizes high-bandwidth global memory, fast on-chip shared memory, and many registers, complemented by caches and specialized read-only paths.

Latency is hidden by switching between many active warps, so architectural resources such as registers and shared memory indirectly control performance through occupancy. The connection between GPU and host, often across PCI Express or higher bandwidth links, introduces a separation between computation and data movement that must be managed at the application level.

Understanding these architectural features prepares you to use GPU programming models effectively and to interpret performance behavior of GPU-accelerated HPC applications.

Comments

Please login to add a comment.

Don't have an account? Register now!