Table of Contents
Overview: Why GPU Memory Hierarchy Matters
GPU performance depends heavily on how you move data, not just how many cores you use. On modern GPUs, the difference between using the “right” memory and the “wrong” memory can easily be a 10–100× performance factor.
This chapter focuses on:
- The main types of memory on a GPU
- Their key properties: latency, bandwidth, capacity, and scope
- How the hierarchy influences kernel design and data layout
- Typical usage patterns and common pitfalls
We’ll describe things using CUDA-like terminology, but the concepts apply similarly to other GPU models (OpenCL, HIP, etc.).
Big Picture: GPU Memory Stack
Conceptually, you can think of GPU memory as layers, from fastest/smallest/closest to the compute units, to slowest/largest/farthest:
- Registers (per thread)
- Local memory (per thread, but in device DRAM)
- Shared memory / L1 (per thread block / workgroup)
- L2 cache (shared by all SMs / compute units)
- Global memory (device DRAM)
- Constant & texture memory (global but cached/specialized)
- Host memory & storage (CPU RAM, disks – accessed over PCIe or NVLink)
We’ll now go through the important ones in order of “closeness” to computation.
Registers
Registers are the smallest and fastest storage on the GPU.
- Scope: One thread
- Lifetime: Duration of the thread / kernel execution (for that variable)
- Location: On-chip, in the streaming multiprocessor (SM) / compute unit
- Latency: A few cycles (very fast)
- Capacity: Limited number per SM, divided among active threads
Registers hold:
- Thread-local variables (e.g., loop counters, temporary values)
- Inputs brought from memory and used in arithmetic
For performance, you want:
- Frequently used values to stay in registers.
- Not to use “too many” registers per thread, or you reduce how many threads can run at once (this reduces occupancy and can hurt performance).
When a kernel uses more registers than the hardware allows per thread, the compiler “spills” some values to local memory, which is much slower.
Local Memory (Per-Thread but in Global DRAM)
Despite the name, local memory in CUDA (and similar in other APIs) is not on-chip; it lives in device global DRAM but is logically private to a thread.
- Scope: One thread (private address space)
- Lifetime: Duration of the thread
- Location: Device DRAM, usually cached by L1/L2
- Latency: Similar to global memory (hundreds of cycles)
- Capacity: Limited mainly by global memory size
Local memory is used for:
- Register spills (handled by the compiler)
- Large thread-local arrays that can’t fit in registers
- Variables that require indexing with non-constant indices (where registers can’t be used efficiently)
You normally don’t intend to use local memory; it appears when:
- You declare large local arrays inside kernels.
- You have complex indexing that forces spills.
- The compiler runs out of registers.
In performance tuning, one of the goals is to reduce local memory usage by:
- Simplifying kernels.
- Splitting large kernels into smaller ones.
- Refactoring data structures.
Shared Memory (On-Chip, Per Block)
Shared memory is explicitly managed on-chip memory accessible by all threads in a block (CUDA) or workgroup (OpenCL).
- Scope: Threads within a block / workgroup
- Lifetime: While the block is running
- Location: On-chip SRAM, near the compute units
- Latency: Tens of cycles (much faster than global memory)
- Capacity: Limited (e.g., tens of KB per SM)
Key properties:
- User-managed: you explicitly read from global memory into shared memory, operate on it, then write results back.
- Enables data reuse within a block, reducing global memory traffic.
- Supports synchronization via barriers (e.g.,
__syncthreads()in CUDA).
Typical uses:
- Tiling (blocking) for matrix operations:
- Load a tile of a matrix from global memory into shared memory.
- Have all threads in the block reuse that tile multiple times.
- Local caching of data needed by multiple threads.
- Implementing reductions, scans, and other collective operations over the threads in a block.
Bank Conflicts
Shared memory is divided into banks that can be accessed in parallel.
- If different threads in a warp access different addresses in the same bank, requests serialize → bank conflict → lower performance.
- If each thread accesses different banks (or all access the same address, in architectures that support broadcast), performance is good.
You don’t need the full hardware details here, but conceptually:
- Stride patterns like
data[threadIdx.x * stride]can cause conflicts, depending onstride. - Padding shared memory arrays (e.g., using
N+1instead ofN) can avoid conflicts.
Shared memory layout and access patterns are a major part of GPU performance optimization.
L1 and L2 Caches
Modern GPUs have hardware-managed caches similar in spirit to CPU caches.
- L1 cache:
- Often combined with shared memory (configurable split between them).
- Per-SM cache.
- Caches data from global memory (and sometimes local memory).
- L2 cache:
- Shared by all SMs.
- Larger but slower than L1.
- Central point for global memory traffic (including accesses from all cores and often copies to/from host memory).
Key roles:
- Hide some of the cost of global memory accesses.
- Reduce bandwidth by avoiding repeated DRAM accesses for the same data.
- Support read-only or specialized caches for constant/texture memory (more below).
From a programmer’s perspective:
- You don’t control caches directly (unlike shared memory).
- But you can design access patterns that work with caches:
- Spatial locality: consecutive threads access consecutive addresses.
- Temporal locality: data likely to be reused soon is accessed “close together” in time.
Some APIs offer hints (e.g., specifying read-only data to enable better caching), but the core behavior is automatic.
Global Memory (Device DRAM)
Global memory is the main memory on the GPU device.
- Scope: All threads, all blocks, all kernels on a device
- Lifetime: As long as you allocate it (e.g., via
cudaMalloc) or until context destruction - Location: Off-chip GDDR/HBM memory
- Latency: High (hundreds of cycles)
- Bandwidth: Very high (hundreds of GB/s), but only if used efficiently
- Capacity: GB-scale
Global memory is where:
- Large arrays and data structures for your problem live.
- Inputs/outputs are stored between kernel launches.
- Host–device transfers ultimately read/from and write/to.
Performance considerations focus on memory coalescing and access patterns.
Memory Coalescing
Threads are grouped into warps (or wavefronts). When threads in a warp access global memory:
- If threads access contiguous addresses (e.g., thread 0 reads
A[0], thread 1 readsA[1], …), accesses can be combined into fewer large transactions → high effective bandwidth. - If threads access scattered or irregular addresses, many small transactions are needed → low bandwidth, increased latency.
Coalescing rules differ by architecture, but a few general guidelines:
- Arrange data such that the index mapped to
threadIdx.x(inside a warp) corresponds to the fastest-varying dimension in memory. - Prefer structure of arrays (SoA) over array of structures (AoS) when many threads access individual fields:
- SoA:
x[i], y[i], z[i]in separate arrays → contiguous access. - AoS:
struct { float x; float y; float z; } p[N];→ threads may access strided patterns. - Avoid patterns where each thread in a warp accesses memory far apart (e.g.,
A[threadId * big_stride]).
Global memory is slow in latency terms, so GPU hardware hides latency by running many threads concurrently. Coalesced, predictable access maximizes the use of available bandwidth.
Constant Memory
Constant memory is a small, read-only memory region optimized for broadcasting the same value to many threads.
- Scope: All threads on the device
- Lifetime: As long as the constant is defined / bound
- Location: Device memory, with a small on-chip constant cache
- Usage: Read-only from device kernels; writes from host
Performance characteristics:
- If all threads in a warp read the same constant address:
- Value can be broadcast efficiently → very fast.
- If threads read different constant addresses:
- Accesses serialize within a warp → slower (similar to regular global loads).
Typical uses:
- Small parameter sets: physical constants, coefficients, lookup tables.
- Read-only configuration data shared across many threads.
Your code usually declares __constant__ arrays/variables (CUDA) and initializes them from host code.
Texture Memory and Read-Only Caches
Texture memory (or more generally, read-only data caches) is another specialized path to device memory.
- Scope: All threads
- Lifetime: As long as texture/buffer is bound
- Location: Device memory, with dedicated caches and units
- Features:
- Caching optimized for spatial locality (2D/3D).
- Optional hardware interpolation and addressing modes (wrap, clamp, etc.).
Typical uses:
- Sampling fields (e.g., images, grids, volumes) where nearby threads access nearby coordinates.
- Read-only tables with irregular access patterns, where the texture cache can help.
Modern CUDA also has a generic read-only data cache accessible via, e.g., __ldg() on some architectures, for data that won’t be modified by the kernel. This can route reads through a path optimized for read-only caching.
The main decisions:
- Use texture/read-only caches for data:
- That is not modified by the kernel.
- That has good spatial locality, especially in 2D/3D.
- Use regular global loads for everything else, focusing on coalescing.
Host Memory and Interconnect (PCIe / NVLink)
From the GPU’s perspective, host RAM is another, much slower level in the hierarchy, reached over an interconnect:
- PCIe:
- Common in most systems.
- High latency, limited bandwidth compared to GPU DRAM.
- NVLink / similar high-speed links:
- Higher bandwidth and lower latency than PCIe.
- Used in some HPC systems to connect GPUs to CPUs and to each other.
Data movement between host memory and device global memory is expensive relative to on-GPU accesses.
Key implications:
- Minimize host–device transfers inside performance-critical loops.
- Move data once, keep it on the GPU for as many kernels as possible, then transfer results back.
- In multi-GPU setups, understand when data must cross inter-GPU links.
Host memory is typically managed by CPU code; in this chapter, the key takeaway is that from a GPU kernel’s point of view, it is much further away than global memory and usually not directly visible.
Putting It Together: Design Patterns for the Hierarchy
When you write GPU kernels, an effective mental model is:
- Put what you can in registers (per-thread variables).
- Use shared memory to cache data reused by threads within a block:
- Load tiles/chunks from global → shared.
- Do computations that reuse these tiles.
- Write results back to global only when needed.
- Access global memory coalesced:
- Align data layout with thread layout.
- Prefer contiguous, predictable access patterns.
- Use constant / texture / read-only caches for:
- Small read-only parameters.
- Fields or tables with spatial locality or irregular patterns.
- Minimize host–device data movement:
- Batch kernels and keep data resident on the GPU.
In practice, high-performance GPU codes:
- Carefully map data structures to this hierarchy.
- Use profiling tools to see where memory bandwidth and latency are limiting performance.
- Iteratively refine memory access patterns and hierarchy usage.
Understanding this hierarchy is essential before diving into programming models like CUDA or OpenACC, where you’ll see explicit mechanisms for controlling how and where data lives on the GPU.