10.3 Memory hierarchy on GPUs

Table of Contents

Overview of GPU Memory Hierarchy

GPUs achieve high performance by running many threads in parallel, but the memory system strongly limits how fast those threads can actually access data. The memory hierarchy on GPUs is a set of different storage types with very different sizes, latencies, and access patterns. For practical GPU programming, especially with CUDA or OpenACC, you must understand where your data lives and how different memories interact.

In this chapter, we look at the typical memory hierarchy of modern discrete GPUs, using CUDA terminology for concreteness. The exact names may differ on other APIs or vendors, but the structure and trade offs are very similar.

Key idea: GPU performance is often limited by memory, not arithmetic. Understanding the GPU memory hierarchy is essential for performance.

Global Memory

Global memory is the main large memory on the GPU. It resides in off chip DRAM attached to the device. All threads on the GPU can access global memory, and it persists for the duration of the kernel launch, or longer if you manage it explicitly.

Global memory has three key characteristics. It is large, usually many gigabytes. It is relatively slow, with high latency, so an access can take hundreds of clock cycles. It is accessed through a hierarchy of caches and is optimized for regular, coalesced accesses by groups of threads.

On CUDA capable GPUs, data you allocate with functions such as cudaMalloc usually lives in global memory. Kernels launched on the device typically read input arrays and write results through global memory pointers passed as kernel arguments.

Because global memory is slow compared to on chip resources, programmers try to reduce the number of accesses, reuse data in faster memories when possible, and ensure that when threads access global memory, they do so in a way the hardware can combine into efficient transactions.

Local Memory

In GPU terminology, local memory is a private memory space for each thread. It is not the same as local variables in a CPU stack. Local memory is used by the compiler for thread private data that does not fit in registers, or that requires addressing patterns that registers cannot support.

Although the name suggests something close to the thread, local memory usually resides in the same off chip DRAM as global memory. It therefore has similar latency and bandwidth characteristics to global memory. It is simply a different address space with per thread scoping.

Common reasons data ends up in local memory are large arrays declared in device functions, excessive register usage that causes register spilling, and complex indexing that prevents allocation in registers. When this happens, performance can drop significantly, because each spill and fill is an additional access to off chip memory.

Understanding when the compiler uses local memory often requires inspecting compiler reports or using profiling tools that show register usage and local memory traffic.

Shared Memory

Shared memory is an on chip memory that is shared by all threads in a block. It combines features of a fast scratchpad and a user controlled cache. Threads in a block can cooperatively load data from global memory into shared memory, then reuse it multiple times at much lower latency.

Shared memory is relatively small, typically on the order of tens of kilobytes per streaming multiprocessor, and is divided among the blocks resident on that multiprocessor. It is much faster than global memory. Latency is a small number of cycles if there are no bank conflicts and the access pattern is friendly.

Because threads in a block can see the same shared memory, it is widely used for tiling algorithms. A block loads a tile of a larger data structure into shared memory, performs a computation such as a small matrix multiply or a local stencil, then writes results back to global memory. This reduces global memory traffic and exploits data reuse.

Shared memory is physically organized into banks, and each bank can typically service one access per cycle. If multiple threads in a warp access different addresses that map to the same bank, accesses serialize, which is called a bank conflict. Proper indexing and data layout can avoid or reduce conflicts.

Shared memory is explicitly managed in CUDA with declarations such as __shared__ and explicit reads and writes. In OpenACC, compilers may automatically map some data to fast memory, but advanced users still think in terms of shared memory like spaces.

Registers and Register File

Registers are the fastest storage available on the GPU. Each thread has its own set of registers, and register contents are not visible to other threads. The register file resides on chip and provides single cycle access for most arithmetic operations.

The number of registers per thread is limited. The hardware provides a fixed number of registers per streaming multiprocessor, and these must be divided among all active threads. If each thread uses many registers, fewer threads can be resident at once, which can reduce occupancy.

Occupancy is the fraction of the maximum possible threads that are active and can help hide latency. If occupancy becomes too low, the GPU may not have enough ready threads to switch to while one set waits on memory, which hurts overall performance.

If a kernel’s register demand exceeds what is available per thread, the compiler spills register values into local memory. This reintroduces high latency DRAM access and usually degrades performance. Therefore there is a trade off between rich per thread state and the ability to run many threads concurrently.

Constant Memory

Constant memory is a small, read only memory space on the GPU intended for values that do not change over the lifetime of a kernel, such as physical constants or configuration parameters. It lives in off chip DRAM, but is cached in a specialized on chip constant cache.

The constant cache is optimized for broadcast patterns. When all threads in a warp read the same constant address, the hardware can fetch the value once and broadcast it to all threads, which is very efficient. When threads in a warp read different constant addresses, performance can degrade.

On CUDA GPUs, constant memory is limited in size, on the order of tens of kilobytes of visible constant space per module, with an associated constant cache. Proper use is for small parameter tables, coefficients, or read only configuration that many threads need.

Because constant memory is cached, repeated reads of the same location can be very fast after the first access. However, it is not a general purpose cache for arbitrary data structures.

Texture and Read Only Data Caches

Texture memory is another read only memory space, originally designed for graphics workloads. It adds features like filtering and addressing modes, but for compute workloads, the main benefit is a separate cache hierarchy tuned for certain access patterns.

In modern CUDA architectures, texture and read only data accesses often share or overlap with a generic read only data cache. Programmers can provide hints that some global memory accesses are read only and may exhibit spatial locality, which lets the hardware use specialized caching policies.

Texture or read only caches can improve performance when many threads access nearby locations in arrays, or when 2D or 3D spatial locality exists. They can also benefit irregular access patterns where coalescing is difficult, but some locality remains.

The size of these caches is limited, and replacement is automatic. This distinguishes them from shared memory, which you manage explicitly. For small to medium sized lookup tables and patterns with reuse across threads, read only caches can significantly reduce global memory traffic.

L1 and L2 Caches

Modern GPUs include general purpose caches, usually a device wide L2 cache and one or more L1 like caches close to the streaming multiprocessors. Their exact configuration and interaction with shared memory depends on the architecture, but the general idea is similar to CPU caches, although policies are tuned for massive parallelism.

The L2 cache sits between global memory and the rest of the chip. It is shared by all streaming multiprocessors and helps reduce repeated DRAM accesses for data reused across blocks or kernels, or for accesses that exhibit temporal locality at device scope.

Each streaming multiprocessor typically has an L1 like cache that serves as the first level for global memory accesses. Some architectures let you configure a trade off between shared memory size and L1 cache capacity. This reflects the balance between programmer managed and hardware managed fast storage.

GPU caches aim to serve coalesced accesses from warps. When threads in a warp read consecutive addresses, the hardware combines these into a few cache line fetches. If threads access scattered addresses, more cache lines are loaded, and caching becomes less effective.

Caches are mostly transparent. You do not allocate data directly in them, but you see their effect in performance. Profiling tools can report cache hit rates and help determine whether a kernel is bandwidth limited at DRAM level or already limited by cache behavior.

Memory Coalescing and Access Granularity

A critical feature of GPU memory systems is coalescing. Threads are grouped into warps, and the hardware services memory requests at warp granularity. If threads in a warp access addresses that fall into a small number of contiguous memory segments, the hardware merges these into a few large transactions. This is called coalesced access.

If, instead, each thread accesses a distant location, such as a stride that is large and not aligned, the warp generates many separate memory transactions. This increases pressure on the memory system and often reduces effective bandwidth.

Coalescing interacts with the memory hierarchy because large coalesced transactions map naturally into cache line transfers and efficient DRAM bursts. The design of data structures, such as favoring structures of arrays over arrays of structures for some workloads, often aims at improving coalescing.

For shared memory, a related concept is bank conflict behavior. While global memory and caches care about coalesced lines, shared memory cares about banking. Both phenomena shape how GPUs aggregate or serialize memory accesses.

Rule of thumb: Neighboring threads should access neighboring data whenever possible, so the hardware can coalesce accesses and use caches and DRAM bandwidth efficiently.

Memory Spaces and Visibility

Each type of GPU memory has its own visibility and lifetime properties. Registers and local memory are private to a thread. Shared memory is visible to all threads in a block. Global, constant, and texture memories are visible to all threads on the device, and often across kernels, as long as the data is allocated and not freed.

Synchronization operations such as thread block barriers and device wide memory fences are needed to coordinate accesses and ensure correct visibility when threads communicate through memory. The memory hierarchy interacts with these rules, because some caches may hold stale data if flushes or correct ordering are not enforced.

Memory consistency models on GPUs continue to evolve, but from a programming perspective, the key is to follow the synchronization primitives defined by the programming model. Correctness should not rely on accidental cache behavior.

Unified Memory and Host Device Interactions

Although the core of this chapter focuses on the GPU internal hierarchy, the interaction between GPU memory and host memory is also important. Data often begins in CPU memory, moves to GPU global memory across an interconnect, is processed, then returns.

Unified memory mechanisms abstract some of this data movement and give the illusion of a single address space. Under the hood, the system moves data between host memory and GPU memory, and may migrate pages on demand. These migrations involve the same global memory and cache structures inside the GPU.

From the point of view of the GPU, host memory is typically much farther away and has even higher effective latency. Efficient GPU codes move data in large chunks, reuse data while it is resident in device memory, and minimize back and forth transfers.

Summary of Trade Offs in GPU Memory Hierarchy

The GPU memory hierarchy presents a layered set of trade offs. Registers and shared memory are fast but small and must be managed carefully. Global memory is large but slow, and performance depends strongly on access patterns and cache behavior. Constant and read only memories add specialized caches for specific patterns, such as broadcast or spatial locality. L1 and L2 caches help automatically, but cannot compensate for completely irregular or uncoalesced access patterns.

Effective use of this hierarchy means placing data where it best fits in terms of size, access frequency, and sharing pattern, and structuring computations to maximize locality and regularity. Later chapters that discuss CUDA, OpenACC, and performance considerations for accelerators will show how to apply these ideas in concrete kernel designs and optimization strategies.

Comments

Please login to add a comment.

Don't have an account? Register now!