10.1 GPU architecture basics

Table of Contents

Overview of GPU Architecture

In HPC, GPUs are used as massively parallel numeric accelerators. Architecturally, they are very different from CPUs: where a CPU is optimized for low-latency execution of a few threads, a GPU is optimized for high-throughput execution of thousands of threads.

At a high level, a GPU consists of:

Many streaming multiprocessors (SMs) or compute units
A large pool of lightweight hardware threads
A deep memory hierarchy optimized for throughput
Specialized hardware for fast context switching and scheduling

This chapter focuses on the structural elements common to most modern discrete GPUs (NVIDIA, AMD, Intel) as they relate to HPC use.

Core Components: SMs and Cores

Streaming Multiprocessors / Compute Units

The basic building block of a GPU is the streaming multiprocessor (SM in NVIDIA terminology, compute unit or CU for AMD, Xe-core for Intel GPUs).

Each SM:

Contains multiple arithmetic logic units (ALUs), sometimes called CUDA cores, stream processors, or shader cores
Executes many threads in groups (warps, wavefronts, or subgroups)
Has its own small, fast memories (register file, shared memory / local data store)
Has hardware to schedule and switch between thread groups

Key properties for HPC:

Many SMs per GPU (dozens on modern devices)
Many cores per SM (tens to hundreds)
Massive total core count, but each core is relatively simple compared to a CPU core

GPU “Cores” vs CPU Cores

GPU cores are:

Simpler: limited control flow, optimized for predictable, data-parallel work
Narrow: usually focused on FP32/FP64 arithmetic plus some integer and special functions
Dependent on the SM: they do not fetch their own instructions independently; they execute in lockstep groups

CPU cores are:

Complex: out-of-order execution, large caches, branch prediction, etc.
Designed for low latency and single-thread performance
Fewer in number

This difference underpins why GPUs excel at data-parallel workloads and struggle with heavily branched, irregular code.

SIMT Execution: Warps and Wavefronts

GPUs use a programming and execution model often described as SIMT (Single Instruction, Multiple Threads).

Threads are grouped into fixed-size units:

NVIDIA: warps (typically 32 threads)
AMD: wavefronts (usually 32 or 64 threads)
Intel: subgroups (size depends on architecture)

All threads in a warp execute the same instruction at the same time on different data elements.

The scheduler on each SM:

Selects a warp that is ready to execute
Issues a single instruction that is applied across all active threads in that warp
Switches to another warp when the current one stalls (e.g., on memory)

Branching and Divergence

When threads in a warp follow different control paths (e.g., different if branches), the warp must:

Execute one branch path with a subset of active threads while others are masked off
Then execute the other path with the complementary subset

This is called control-flow divergence and reduces effective parallelism. In HPC kernels, you typically try to:

Minimize branching inside performance-critical loops
Structure data and conditions so most threads in a warp follow the same path

Thread Hierarchy and Organization

Although each programming model has its own terminology, the hardware idea is similar:

Individual thread: smallest unit of execution
Warp/wavefront/subgroup: hardware scheduling unit (fixed size)
Thread block / work-group: a collection of warps that execute on the same SM
Grid / ND-range: collection of all blocks that form a kernel launch

From the architecture perspective:

A thread block is assigned to a single SM
An SM can host multiple blocks concurrently (occupancy), depending on resource usage (registers, shared memory, etc.)
Threads in the same block can:

Synchronize with each other
Share data via on-chip shared memory

Inter-block communication is not handled directly by the hardware; it typically requires global memory or separate kernel launches.

GPU Memory Hierarchy

Like CPUs, GPUs have a hierarchy of memories with different sizes, latencies, and bandwidths. For HPC, understanding their roles is essential for performance.

On-Device Global Memory (VRAM / HBM)

Large capacity (typically multiple GBs)
High bandwidth (hundreds of GB/s to multiple TB/s on high-end devices)
Relatively high latency compared to on-chip memories
Accessible by all threads on the GPU
Often implemented as GDDR or HBM (High Bandwidth Memory)

Global memory is where:

Your main data arrays typically reside during GPU computation
Kernels read and write most of their inputs and outputs

Caches

Modern GPUs have multiple cache levels:

L2 cache:

Shared across the entire GPU
Serves as a large on-chip cache for global memory accesses

L1 / texture / data cache:

Located closer to each SM
Smaller but faster
Efficiency depends strongly on access patterns (spatial and temporal locality)

Not all memory types are cached the same way, and cache behavior can be architecture- and access-pattern-dependent. In HPC, coalesced and regular memory accesses are used to maximize cache and memory-system efficiency.

Shared Memory / Local Data Store

Each SM has a small, explicitly managed on-chip memory that:

Has much lower latency than global memory
Is shared among all threads in a block
Is programmer-controlled (you decide what to store there)
Is often banked, with possible bank conflicts if accessed poorly

Typical HPC uses:

Staging tiles of matrices or subarrays to enable data reuse
Implementing small, fast scratchpads for reductions, transposes, and stencils
Minimizing repeated global memory reads for the same data

Registers

Fastest storage level per thread
Allocated from a large register file on each SM
Not visible directly to other threads
Register usage per thread affects:

How many threads/blocks can be resident on an SM
Overall occupancy and ability to hide latency

If a kernel uses too many registers, the compiler may:

Spill values to local memory (which resides in global memory and is slow)
Limit the number of concurrent warps per SM

Constant and Texture Memories (Vendor-Specific Names)

Most GPUs provide specialized read-only memories:

Constant memory:

Cached and optimized for broadcast (many threads reading the same value)
Good for small parameter tables or constants used by all threads

Texture / read-only data caches:

Optimized for particular access patterns (e.g., 2D spatial locality)
Might offer hardware interpolation and other features (more common in graphics, sometimes useful in HPC)

Using these appropriately can reduce pressure on global memory.

Host–Device Relationship and Interconnect

Discrete GPUs are typically attached to a CPU host via an interconnect such as:

PCIe (various generations)
NVLink, Infinity Fabric, or other high-bandwidth proprietary links

Architecturally:

The CPU (host) and GPU (device) have separate memory spaces
Data must be moved between host memory and device global memory
Interconnect bandwidth and latency are much lower/higher (respectively) than on-device memory

For HPC:

Host–device transfers are expensive relative to on-device computation
Algorithms are often designed to:

Transfer data in large chunks
Perform as much computation as possible per transfer
Avoid unnecessary round trips between CPU and GPU

Some modern systems support features like:

Unified virtual addressing (UVA)
On-demand page migration
GPU-direct communication with network/storage

But physically, bandwidth and latency constraints still matter for performance.

Latency Hiding and Occupancy

GPUs tolerate high latency (especially memory latency) by running many threads concurrently:

When one warp stalls (e.g., waiting for memory), the SM’s scheduler:

Quickly switches to another ready warp
Overlaps memory latency with computation from other warps

No expensive OS-level context switches are needed; state is kept in hardware

Occupancy:

The ratio of active warps per SM to the maximum possible
Affected by:

Registers per thread
Shared memory per block
Threads per block

Higher occupancy generally improves the GPU’s ability to hide latency, but:

It is not the only performance factor
Very high occupancy with poor memory access patterns can still perform badly

Architecturally, GPU design assumes that:

There are many independent threads
Most threads frequently perform memory operations
The scheduler can always find work to do while others wait

Specialized Functional Units

Besides general-purpose ALUs, GPUs often include specialized hardware units:

Tensor / matrix cores:

Accelerate small matrix-multiply–accumulate operations
Often support mixed precision (e.g., FP16, BF16, TF32, INT8)
Critical for AI workloads and increasingly leveraged in HPC (e.g., mixed-precision linear algebra)

Special function units:

Implement transcendental functions (sin, cos, exp, log, etc.)
Trade precision vs. throughput depending on mode

Atomic units:

Support atomic operations on integers and floating-point data in memory

At the architectural level, these units:

Operate alongside standard ALUs within each SM
Are fed via specific instruction types generated by compilers or intrinsics
Have their own throughput/latency characteristics

Multi-GPU and Node-Level GPU Topologies

Within a single node, there may be multiple GPUs connected via:

PCIe fabric (often via switches)
Dedicated GPU–GPU links (NVLink, Infinity Fabric, etc.)

Architecturally, this affects:

Peer-to-peer (P2P) capabilities between GPUs
Bandwidth for exchanging data between GPUs versus via the CPU
How multi-GPU algorithms are structured (e.g., direct GPU–GPU communication vs. staging via host)

The physical layout (topology) matters for:

How workloads are mapped to GPUs
How data is partitioned across devices
Achievable scaling efficiency inside a node

Precision and Throughput Trade-offs

GPU hardware is designed with different arithmetic pipelines:

FP64 (double precision): essential for many traditional HPC applications
FP32 (single precision): higher throughput, widely used
Lower precisions (FP16, BF16, TF32, INT8): extremely high throughput, especially on tensor cores

Architecturally:

FP64 units may be fewer or slower than FP32 units
Tensor cores often operate on lower-precision data but accumulate in higher precision

This leads to common patterns in HPC:

Mixed-precision algorithms where most computation uses fast low-precision units
Occasional high-precision refinement steps using FP64

Understanding the GPU’s precision mix is important when selecting hardware for specific scientific workloads.

Summary

From an architectural perspective, GPUs:

Consist of many SMs packed with simple cores
Execute threads in lockstep groups (warps/wavefronts) using SIMT
Rely on a deep memory hierarchy with fast on-chip memories and very high-bandwidth global memory
Hide latency by running large numbers of concurrent threads
Communicate with the CPU and other GPUs via relatively slower interconnects

These features make GPUs highly effective for data-parallel, compute-intensive workloads that can exploit massive concurrency and structured memory access patterns, which is central to their role in HPC.

Comments

Please login to add a comment.

Don't have an account? Register now!