Kahibaro
Discord Login Register

GPU architecture basics

Overview of GPU Architecture

In HPC, GPUs are used as massively parallel numeric accelerators. Architecturally, they are very different from CPUs: where a CPU is optimized for low-latency execution of a few threads, a GPU is optimized for high-throughput execution of thousands of threads.

At a high level, a GPU consists of:

This chapter focuses on the structural elements common to most modern discrete GPUs (NVIDIA, AMD, Intel) as they relate to HPC use.

Core Components: SMs and Cores

Streaming Multiprocessors / Compute Units

The basic building block of a GPU is the streaming multiprocessor (SM in NVIDIA terminology, compute unit or CU for AMD, Xe-core for Intel GPUs).

Each SM:

Key properties for HPC:

GPU “Cores” vs CPU Cores

GPU cores are:

CPU cores are:

This difference underpins why GPUs excel at data-parallel workloads and struggle with heavily branched, irregular code.

SIMT Execution: Warps and Wavefronts

GPUs use a programming and execution model often described as SIMT (Single Instruction, Multiple Threads).

The scheduler on each SM:

Branching and Divergence

When threads in a warp follow different control paths (e.g., different if branches), the warp must:

This is called control-flow divergence and reduces effective parallelism. In HPC kernels, you typically try to:

Thread Hierarchy and Organization

Although each programming model has its own terminology, the hardware idea is similar:

From the architecture perspective:

Inter-block communication is not handled directly by the hardware; it typically requires global memory or separate kernel launches.

GPU Memory Hierarchy

Like CPUs, GPUs have a hierarchy of memories with different sizes, latencies, and bandwidths. For HPC, understanding their roles is essential for performance.

On-Device Global Memory (VRAM / HBM)

Global memory is where:

Caches

Modern GPUs have multiple cache levels:

Not all memory types are cached the same way, and cache behavior can be architecture- and access-pattern-dependent. In HPC, coalesced and regular memory accesses are used to maximize cache and memory-system efficiency.

Shared Memory / Local Data Store

Each SM has a small, explicitly managed on-chip memory that:

Typical HPC uses:

Registers

If a kernel uses too many registers, the compiler may:

Constant and Texture Memories (Vendor-Specific Names)

Most GPUs provide specialized read-only memories:

Using these appropriately can reduce pressure on global memory.

Host–Device Relationship and Interconnect

Discrete GPUs are typically attached to a CPU host via an interconnect such as:

Architecturally:

For HPC:

Some modern systems support features like:

But physically, bandwidth and latency constraints still matter for performance.

Latency Hiding and Occupancy

GPUs tolerate high latency (especially memory latency) by running many threads concurrently:

Occupancy:

Architecturally, GPU design assumes that:

Specialized Functional Units

Besides general-purpose ALUs, GPUs often include specialized hardware units:

At the architectural level, these units:

Multi-GPU and Node-Level GPU Topologies

Within a single node, there may be multiple GPUs connected via:

Architecturally, this affects:

The physical layout (topology) matters for:

Precision and Throughput Trade-offs

GPU hardware is designed with different arithmetic pipelines:

Architecturally:

This leads to common patterns in HPC:

Understanding the GPU’s precision mix is important when selecting hardware for specific scientific workloads.

Summary

From an architectural perspective, GPUs:

These features make GPUs highly effective for data-parallel, compute-intensive workloads that can exploit massive concurrency and structured memory access patterns, which is central to their role in HPC.

Views: 12

Comments

Please login to add a comment.

Don't have an account? Register now!