Table of Contents
Overview of GPU Architecture
In HPC, GPUs are used as massively parallel numeric accelerators. Architecturally, they are very different from CPUs: where a CPU is optimized for low-latency execution of a few threads, a GPU is optimized for high-throughput execution of thousands of threads.
At a high level, a GPU consists of:
- Many streaming multiprocessors (SMs) or compute units
- A large pool of lightweight hardware threads
- A deep memory hierarchy optimized for throughput
- Specialized hardware for fast context switching and scheduling
This chapter focuses on the structural elements common to most modern discrete GPUs (NVIDIA, AMD, Intel) as they relate to HPC use.
Core Components: SMs and Cores
Streaming Multiprocessors / Compute Units
The basic building block of a GPU is the streaming multiprocessor (SM in NVIDIA terminology, compute unit or CU for AMD, Xe-core for Intel GPUs).
Each SM:
- Contains multiple arithmetic logic units (ALUs), sometimes called CUDA cores, stream processors, or shader cores
- Executes many threads in groups (warps, wavefronts, or subgroups)
- Has its own small, fast memories (register file, shared memory / local data store)
- Has hardware to schedule and switch between thread groups
Key properties for HPC:
- Many SMs per GPU (dozens on modern devices)
- Many cores per SM (tens to hundreds)
- Massive total core count, but each core is relatively simple compared to a CPU core
GPU “Cores” vs CPU Cores
GPU cores are:
- Simpler: limited control flow, optimized for predictable, data-parallel work
- Narrow: usually focused on FP32/FP64 arithmetic plus some integer and special functions
- Dependent on the SM: they do not fetch their own instructions independently; they execute in lockstep groups
CPU cores are:
- Complex: out-of-order execution, large caches, branch prediction, etc.
- Designed for low latency and single-thread performance
- Fewer in number
This difference underpins why GPUs excel at data-parallel workloads and struggle with heavily branched, irregular code.
SIMT Execution: Warps and Wavefronts
GPUs use a programming and execution model often described as SIMT (Single Instruction, Multiple Threads).
- Threads are grouped into fixed-size units:
- NVIDIA: warps (typically 32 threads)
- AMD: wavefronts (usually 32 or 64 threads)
- Intel: subgroups (size depends on architecture)
- All threads in a warp execute the same instruction at the same time on different data elements.
The scheduler on each SM:
- Selects a warp that is ready to execute
- Issues a single instruction that is applied across all active threads in that warp
- Switches to another warp when the current one stalls (e.g., on memory)
Branching and Divergence
When threads in a warp follow different control paths (e.g., different if branches), the warp must:
- Execute one branch path with a subset of active threads while others are masked off
- Then execute the other path with the complementary subset
This is called control-flow divergence and reduces effective parallelism. In HPC kernels, you typically try to:
- Minimize branching inside performance-critical loops
- Structure data and conditions so most threads in a warp follow the same path
Thread Hierarchy and Organization
Although each programming model has its own terminology, the hardware idea is similar:
- Individual thread: smallest unit of execution
- Warp/wavefront/subgroup: hardware scheduling unit (fixed size)
- Thread block / work-group: a collection of warps that execute on the same SM
- Grid / ND-range: collection of all blocks that form a kernel launch
From the architecture perspective:
- A thread block is assigned to a single SM
- An SM can host multiple blocks concurrently (occupancy), depending on resource usage (registers, shared memory, etc.)
- Threads in the same block can:
- Synchronize with each other
- Share data via on-chip shared memory
Inter-block communication is not handled directly by the hardware; it typically requires global memory or separate kernel launches.
GPU Memory Hierarchy
Like CPUs, GPUs have a hierarchy of memories with different sizes, latencies, and bandwidths. For HPC, understanding their roles is essential for performance.
On-Device Global Memory (VRAM / HBM)
- Large capacity (typically multiple GBs)
- High bandwidth (hundreds of GB/s to multiple TB/s on high-end devices)
- Relatively high latency compared to on-chip memories
- Accessible by all threads on the GPU
- Often implemented as GDDR or HBM (High Bandwidth Memory)
Global memory is where:
- Your main data arrays typically reside during GPU computation
- Kernels read and write most of their inputs and outputs
Caches
Modern GPUs have multiple cache levels:
- L2 cache:
- Shared across the entire GPU
- Serves as a large on-chip cache for global memory accesses
- L1 / texture / data cache:
- Located closer to each SM
- Smaller but faster
- Efficiency depends strongly on access patterns (spatial and temporal locality)
Not all memory types are cached the same way, and cache behavior can be architecture- and access-pattern-dependent. In HPC, coalesced and regular memory accesses are used to maximize cache and memory-system efficiency.
Shared Memory / Local Data Store
Each SM has a small, explicitly managed on-chip memory that:
- Has much lower latency than global memory
- Is shared among all threads in a block
- Is programmer-controlled (you decide what to store there)
- Is often banked, with possible bank conflicts if accessed poorly
Typical HPC uses:
- Staging tiles of matrices or subarrays to enable data reuse
- Implementing small, fast scratchpads for reductions, transposes, and stencils
- Minimizing repeated global memory reads for the same data
Registers
- Fastest storage level per thread
- Allocated from a large register file on each SM
- Not visible directly to other threads
- Register usage per thread affects:
- How many threads/blocks can be resident on an SM
- Overall occupancy and ability to hide latency
If a kernel uses too many registers, the compiler may:
- Spill values to local memory (which resides in global memory and is slow)
- Limit the number of concurrent warps per SM
Constant and Texture Memories (Vendor-Specific Names)
Most GPUs provide specialized read-only memories:
- Constant memory:
- Cached and optimized for broadcast (many threads reading the same value)
- Good for small parameter tables or constants used by all threads
- Texture / read-only data caches:
- Optimized for particular access patterns (e.g., 2D spatial locality)
- Might offer hardware interpolation and other features (more common in graphics, sometimes useful in HPC)
Using these appropriately can reduce pressure on global memory.
Host–Device Relationship and Interconnect
Discrete GPUs are typically attached to a CPU host via an interconnect such as:
- PCIe (various generations)
- NVLink, Infinity Fabric, or other high-bandwidth proprietary links
Architecturally:
- The CPU (host) and GPU (device) have separate memory spaces
- Data must be moved between host memory and device global memory
- Interconnect bandwidth and latency are much lower/higher (respectively) than on-device memory
For HPC:
- Host–device transfers are expensive relative to on-device computation
- Algorithms are often designed to:
- Transfer data in large chunks
- Perform as much computation as possible per transfer
- Avoid unnecessary round trips between CPU and GPU
Some modern systems support features like:
- Unified virtual addressing (UVA)
- On-demand page migration
- GPU-direct communication with network/storage
But physically, bandwidth and latency constraints still matter for performance.
Latency Hiding and Occupancy
GPUs tolerate high latency (especially memory latency) by running many threads concurrently:
- When one warp stalls (e.g., waiting for memory), the SM’s scheduler:
- Quickly switches to another ready warp
- Overlaps memory latency with computation from other warps
- No expensive OS-level context switches are needed; state is kept in hardware
Occupancy:
- The ratio of active warps per SM to the maximum possible
- Affected by:
- Registers per thread
- Shared memory per block
- Threads per block
- Higher occupancy generally improves the GPU’s ability to hide latency, but:
- It is not the only performance factor
- Very high occupancy with poor memory access patterns can still perform badly
Architecturally, GPU design assumes that:
- There are many independent threads
- Most threads frequently perform memory operations
- The scheduler can always find work to do while others wait
Specialized Functional Units
Besides general-purpose ALUs, GPUs often include specialized hardware units:
- Tensor / matrix cores:
- Accelerate small matrix-multiply–accumulate operations
- Often support mixed precision (e.g., FP16, BF16, TF32, INT8)
- Critical for AI workloads and increasingly leveraged in HPC (e.g., mixed-precision linear algebra)
- Special function units:
- Implement transcendental functions (sin, cos, exp, log, etc.)
- Trade precision vs. throughput depending on mode
- Atomic units:
- Support atomic operations on integers and floating-point data in memory
At the architectural level, these units:
- Operate alongside standard ALUs within each SM
- Are fed via specific instruction types generated by compilers or intrinsics
- Have their own throughput/latency characteristics
Multi-GPU and Node-Level GPU Topologies
Within a single node, there may be multiple GPUs connected via:
- PCIe fabric (often via switches)
- Dedicated GPU–GPU links (NVLink, Infinity Fabric, etc.)
Architecturally, this affects:
- Peer-to-peer (P2P) capabilities between GPUs
- Bandwidth for exchanging data between GPUs versus via the CPU
- How multi-GPU algorithms are structured (e.g., direct GPU–GPU communication vs. staging via host)
The physical layout (topology) matters for:
- How workloads are mapped to GPUs
- How data is partitioned across devices
- Achievable scaling efficiency inside a node
Precision and Throughput Trade-offs
GPU hardware is designed with different arithmetic pipelines:
- FP64 (double precision): essential for many traditional HPC applications
- FP32 (single precision): higher throughput, widely used
- Lower precisions (FP16, BF16, TF32, INT8): extremely high throughput, especially on tensor cores
Architecturally:
- FP64 units may be fewer or slower than FP32 units
- Tensor cores often operate on lower-precision data but accumulate in higher precision
This leads to common patterns in HPC:
- Mixed-precision algorithms where most computation uses fast low-precision units
- Occasional high-precision refinement steps using FP64
Understanding the GPU’s precision mix is important when selecting hardware for specific scientific workloads.
Summary
From an architectural perspective, GPUs:
- Consist of many SMs packed with simple cores
- Execute threads in lockstep groups (warps/wavefronts) using SIMT
- Rely on a deep memory hierarchy with fast on-chip memories and very high-bandwidth global memory
- Hide latency by running large numbers of concurrent threads
- Communicate with the CPU and other GPUs via relatively slower interconnects
These features make GPUs highly effective for data-parallel, compute-intensive workloads that can exploit massive concurrency and structured memory access patterns, which is central to their role in HPC.