Kahibaro
Discord Login Register

Memory hierarchy on GPUs

Overview: Why GPU Memory Hierarchy Matters

GPU performance depends heavily on how you move data, not just how many cores you use. On modern GPUs, the difference between using the “right” memory and the “wrong” memory can easily be a 10–100× performance factor.

This chapter focuses on:

We’ll describe things using CUDA-like terminology, but the concepts apply similarly to other GPU models (OpenCL, HIP, etc.).

Big Picture: GPU Memory Stack

Conceptually, you can think of GPU memory as layers, from fastest/smallest/closest to the compute units, to slowest/largest/farthest:

  1. Registers (per thread)
  2. Local memory (per thread, but in device DRAM)
  3. Shared memory / L1 (per thread block / workgroup)
  4. L2 cache (shared by all SMs / compute units)
  5. Global memory (device DRAM)
  6. Constant & texture memory (global but cached/specialized)
  7. Host memory & storage (CPU RAM, disks – accessed over PCIe or NVLink)

We’ll now go through the important ones in order of “closeness” to computation.

Registers

Registers are the smallest and fastest storage on the GPU.

Registers hold:

For performance, you want:

When a kernel uses more registers than the hardware allows per thread, the compiler “spills” some values to local memory, which is much slower.

Local Memory (Per-Thread but in Global DRAM)

Despite the name, local memory in CUDA (and similar in other APIs) is not on-chip; it lives in device global DRAM but is logically private to a thread.

Local memory is used for:

You normally don’t intend to use local memory; it appears when:

In performance tuning, one of the goals is to reduce local memory usage by:

Shared Memory (On-Chip, Per Block)

Shared memory is explicitly managed on-chip memory accessible by all threads in a block (CUDA) or workgroup (OpenCL).

Key properties:

Typical uses:

Bank Conflicts

Shared memory is divided into banks that can be accessed in parallel.

You don’t need the full hardware details here, but conceptually:

Shared memory layout and access patterns are a major part of GPU performance optimization.

L1 and L2 Caches

Modern GPUs have hardware-managed caches similar in spirit to CPU caches.

Key roles:

From a programmer’s perspective:

Some APIs offer hints (e.g., specifying read-only data to enable better caching), but the core behavior is automatic.

Global Memory (Device DRAM)

Global memory is the main memory on the GPU device.

Global memory is where:

Performance considerations focus on memory coalescing and access patterns.

Memory Coalescing

Threads are grouped into warps (or wavefronts). When threads in a warp access global memory:

Coalescing rules differ by architecture, but a few general guidelines:

Global memory is slow in latency terms, so GPU hardware hides latency by running many threads concurrently. Coalesced, predictable access maximizes the use of available bandwidth.

Constant Memory

Constant memory is a small, read-only memory region optimized for broadcasting the same value to many threads.

Performance characteristics:

Typical uses:

Your code usually declares __constant__ arrays/variables (CUDA) and initializes them from host code.

Texture Memory and Read-Only Caches

Texture memory (or more generally, read-only data caches) is another specialized path to device memory.

Typical uses:

Modern CUDA also has a generic read-only data cache accessible via, e.g., __ldg() on some architectures, for data that won’t be modified by the kernel. This can route reads through a path optimized for read-only caching.

The main decisions:

Host Memory and Interconnect (PCIe / NVLink)

From the GPU’s perspective, host RAM is another, much slower level in the hierarchy, reached over an interconnect:

Data movement between host memory and device global memory is expensive relative to on-GPU accesses.

Key implications:

Host memory is typically managed by CPU code; in this chapter, the key takeaway is that from a GPU kernel’s point of view, it is much further away than global memory and usually not directly visible.

Putting It Together: Design Patterns for the Hierarchy

When you write GPU kernels, an effective mental model is:

  1. Put what you can in registers (per-thread variables).
  2. Use shared memory to cache data reused by threads within a block:
    • Load tiles/chunks from global → shared.
    • Do computations that reuse these tiles.
    • Write results back to global only when needed.
  3. Access global memory coalesced:
    • Align data layout with thread layout.
    • Prefer contiguous, predictable access patterns.
  4. Use constant / texture / read-only caches for:
    • Small read-only parameters.
    • Fields or tables with spatial locality or irregular patterns.
  5. Minimize host–device data movement:
    • Batch kernels and keep data resident on the GPU.

In practice, high-performance GPU codes:

Understanding this hierarchy is essential before diving into programming models like CUDA or OpenACC, where you’ll see explicit mechanisms for controlling how and where data lives on the GPU.

Views: 9

Comments

Please login to add a comment.

Don't have an account? Register now!