Table of Contents
What Makes GPUs and Accelerators Different from CPUs
GPUs and other accelerators are processors designed to run many operations in parallel, trading some flexibility for much higher throughput than general-purpose CPUs.
At a high level:
- CPUs are optimized for low-latency execution of a small number of threads with complex control flow.
- GPUs/accelerators are optimized for high-throughput execution of massively parallel workloads, often with simpler control flow but huge amounts of data.
Key differences relevant to HPC:
- Core count
- CPUs: a few to a few dozen powerful cores.
- GPUs: thousands of simpler cores (organized into streaming multiprocessors or similar units).
- Execution style
- CPUs: focus on single-thread performance and sophisticated branch prediction.
- GPUs: focus on running many threads in lockstep (SIMT/SIMD-style execution), best for regular, data-parallel work.
- Memory system
- CPUs: large, low-latency caches, relatively smaller memory bandwidth.
- GPUs: very high memory bandwidth, but higher latency; optimized for streaming large datasets.
Understanding these trade-offs is crucial when deciding whether and how to offload computation to an accelerator in an HPC context.
Types of Accelerators in HPC
Although “GPU” is often used as shorthand, modern HPC systems use several kinds of accelerators:
- GPUs (Graphics Processing Units)
- General-purpose GPUs (GPGPUs) used for compute, not just graphics.
- Vendors: NVIDIA, AMD, Intel.
- Typical programming models: CUDA (NVIDIA), HIP (AMD), SYCL (Intel/portable), OpenCL, OpenACC, OpenMP offload.
- Many-core accelerators
- Chips with a large number of relatively simple cores, sometimes with wide vector units.
- Historically: Intel Xeon Phi (discontinued but important historically).
- Conceptually similar to GPUs in that they rely on massive parallelism.
- Specialized accelerators
- AI/ML accelerators (e.g., Google TPUs, various NPUs) optimized for matrix/tensor operations, often with reduced precision.
- FPGA-based accelerators (Field Programmable Gate Arrays) configured to implement custom pipelines for specific algorithms.
- In HPC, these are used where a particular algorithm can be highly optimized in hardware.
- On-die or integrated accelerators
- GPUs or vector units integrated on the CPU package (e.g., Intel integrated GPU, AMD APU, specialized matrix units).
- Shorter distances for data movement, but often more limited peak performance compared to discrete accelerators.
When people talk about “GPU nodes” in a cluster, they typically mean compute nodes with one or more discrete GPUs attached via PCIe or a high-speed interconnect like NVLink.
Why GPUs and Accelerators Matter for HPC
Accelerators are widely used in HPC because many scientific and engineering workloads have:
- High arithmetic intensity: many floating-point operations per byte of data.
- Regular data-parallel structure: the same operation applied independently to many data elements.
- Relaxed control complexity: relatively simple branching and control flow.
For such workloads, accelerators offer:
- Higher peak floating-point performance (FLOPS) per node.
- Higher memory bandwidth to feed data-hungry computations.
- Better performance per watt, which is critical at large scale.
Typical HPC use cases:
- Large dense and sparse linear algebra kernels.
- Particle-based simulations (molecular dynamics, N-body, particle-in-cell).
- Grid/mesh-based PDE solvers (finite difference, finite volume, finite element).
- Signal and image processing, FFT-heavy workloads.
- AI/ML components integrated into simulation workflows (“AI for science”).
Basic GPU Architecture Concepts (HPC Perspective)
While a later chapter will dive deeper into GPU architecture, here are a few concepts specifically relevant to understanding accelerators’ role in HPC:
Massive Parallelism and SIMT
Most GPUs follow a SIMT (Single Instruction, Multiple Threads) or related model:
- Threads are grouped into warps/wavefronts (e.g., 32 threads on many NVIDIA GPUs).
- All threads in a warp execute the same instruction at the same time on different data.
- Ideal workloads:
forloops over large arrays where each iteration is independent.
Implication for HPC applications:
- Algorithms with many independent operations on large datasets map well to GPUs.
- Algorithms with irregular control flow (heavy branching, recursion, complex dependencies) are often harder to accelerate efficiently.
Memory Hierarchy on Accelerators (Overview)
Accelerators have a memory hierarchy distinct from the CPU’s:
- Device/global memory: large, high-bandwidth memory (e.g., HBM or GDDR); slower than registers and on-chip memories but faster in bandwidth than typical CPU DRAM.
- On-chip memories (names vary by vendor):
- Shared memory / scratchpad / local memory close to execution units.
- Registers for each thread.
- Small caches, which may be less central than on CPUs depending on architecture.
From an HPC programmer’s point of view:
- Performance is heavily influenced by how data is laid out and accessed.
- Coalesced, sequential access patterns are crucial for achieving high bandwidth.
- Data movement between CPU and accelerator memories can become the bottleneck.
CPU–Accelerator Interaction and Data Movement
In a typical HPC node with accelerators:
- The CPU (“host”):
- Manages the OS, filesystem I/O, job scheduling, MPI processes, etc.
- Launches work on the accelerators and manages data transfers.
- The accelerator (“device”):
- Performs compute-intensive kernels on data that has been moved to its memory.
Key concepts:
- Discrete vs integrated
- Discrete accelerators have their own physical memory; data must be transferred over PCIe or similar.
- Integrated accelerators may share memory or have coherent access to system RAM.
- Host–device transfers
- Data must often be copied explicitly between host and device memory.
- Transfers have relatively high latency and limited bandwidth compared to on-device memory bandwidth.
- Performance guidelines:
- Minimize data movement.
- Overlap computation and communication where possible.
- Reuse data on the device instead of repeatedly transferring it.
For HPC cluster design, interconnects like NVLink or Infinity Fabric can reduce bottlenecks between GPUs and between GPU and CPU compared to standard PCIe.
Multiple GPUs and Accelerator Topologies in Nodes
Modern HPC nodes often contain multiple accelerators. How they are connected matters:
- GPU–CPU topology
- Each GPU may be attached to a particular CPU socket.
- NUMA effects: data allocated on a CPU memory bank closer to a particular GPU may be accessed more efficiently.
- GPU–GPU connectivity
- Some systems have direct GPU–GPU links (NVLink, NVSwitch, proprietary fabrics).
- Others rely on the PCIe fabric and potentially the CPU for routing.
- Faster GPU–GPU links benefit multi-GPU algorithms that exchange data frequently (e.g., domain decomposition methods, distributed neural network training).
For HPC applications:
- Mapping MPI ranks/threads to GPUs and understanding the topology can have a noticeable impact on performance and scalability.
- Hybrid programming models often use one or more MPI processes per GPU, each handling a portion of the global domain.
Programming Models for GPUs and Accelerators (Conceptual Overview)
Detailed programming interfaces appear elsewhere in the course, but at this stage it’s useful to know the main categories of models:
- Vendor-specific low-level APIs
- CUDA (NVIDIA), HIP (AMD), Level Zero (Intel).
- Fine-grained control, highest performance potential, less portable between vendors.
- Portable C++-style models
- SYCL, Kokkos, RAJA, and similar frameworks.
- Aim for performance portability across CPU and GPU architectures.
- Directive-based models
- OpenACC, OpenMP target/offload.
- Use pragmas/directives (
#pragma) in existing CPU code to offload regions to accelerators with minimal code changes. - Library-based acceleration
- Many numerical libraries provide GPU-accelerated backends (BLAS, FFT, solver libraries, ML frameworks).
- Users can gain accelerator performance without writing low-level device code.
In HPC practice, it is common to:
- Use accelerated libraries where possible (e.g., GPU BLAS, FFT).
- Combine MPI between nodes and GPU programming or directives within a node.
When Are Accelerators a Good Fit?
Accelerators are not universally beneficial. They are most effective when:
- The problem is large enough to keep thousands of threads busy.
- The computation has high arithmetic intensity (lots of compute per byte moved).
- Data access patterns can be made regular and coalesced.
- Data can be reused on the device or the cost of host–device transfers is small relative to the computation.
They are less effective when:
- The dataset is very small (kernel launch and transfer overhead dominate).
- The algorithm is heavily branchy or has irregular data structures (e.g., complex graph algorithms).
- The application is constrained by operations that cannot be offloaded easily (e.g., certain legacy libraries, system-level tasks).
Understanding these factors helps decide which parts of an HPC application should be offloaded and how to design algorithms to exploit accelerators.
Performance and Resource Considerations
Accelerators affect not only raw speed but also how HPC resources are requested and used:
- Performance per watt
- Accelerators typically offer more FLOPS per watt than CPUs, contributing to energy-efficient HPC systems.
- Many top supercomputers rely heavily on GPUs for this reason.
- Resource accounting on clusters
- HPC schedulers often expose GPUs as explicit resources (e.g.,
--gpusor--gres=gpu:options). - Jobs must request not only CPU cores and memory but also a certain number and type of accelerators.
- Fair use and scheduling policy may limit how many GPUs a single job can acquire.
- Memory limits
- Device memory capacity is often much smaller than system RAM.
- Applications must work within this limit or use techniques like domain decomposition and streaming to fit.
Later chapters will connect these architectural aspects to job scheduling, application design, and performance optimization strategies specific to GPU-accelerated systems.