Table of Contents
From CPUs to GPUs in HPC
In high performance computing, accelerators are processors that support the main CPU and are specialized for certain kinds of computation. The most important class of accelerators today is the Graphics Processing Unit, or GPU. While the parent chapter has already introduced general CPU concepts, here the focus is on how GPUs and related accelerators differ from CPUs and why this matters for HPC.
A CPU is optimized for low latency and flexible control flow. A GPU is optimized for high throughput on many similar operations in parallel. HPC systems often combine both. The CPU orchestrates the work and runs the operating system, while the GPU executes the numerically heavy, parallel parts of the application.
What is a GPU in the HPC context
GPUs were originally developed to render graphics. Modern GPUs expose their computational resources through general purpose programming models such as CUDA and OpenCL. In HPC, a GPU is treated as a massively parallel coprocessor attached to a CPU. It has its own cores, its own memory, and its own execution model.
In a typical node with GPUs, the CPU process allocates data, transfers it to GPU memory, launches a GPU kernel, and later transfers the results back. A kernel is a function that runs in parallel across a large number of GPU threads. These threads execute similar instructions on different data elements. This style of computing matches many scientific and engineering algorithms that involve large arrays or grids.
Many supercomputers in the Top500 list include GPUs because they provide high performance per watt and high performance per node. The shift from purely CPU based clusters to CPU plus GPU nodes is one of the defining trends in modern HPC.
Why GPUs excel at parallel workloads
The key idea behind GPU design is to trade control complexity for many more arithmetic units. A CPU core has sophisticated branch prediction, large caches, and powerful single thread performance. A GPU has comparatively simpler control for each execution unit, but a very large number of such units. This allows it to keep thousands of threads in flight.
GPUs are particularly effective for data parallel workloads. In such workloads, the same operation is applied to many elements, for example computing forces between particles, updating each cell of a grid, or applying the same matrix operation to many rows or columns. Each GPU thread handles a small portion of the data, and the GPU hardware schedules these threads in groups to hide memory access latencies by switching among them.
The GPU achieves high throughput by combining multithreading with vector like execution. Internally, groups of threads execute in lockstep on the same instruction. If all threads in the group follow the same control path, the hardware runs efficiently. If they diverge, for example due to if statements where different threads take different branches, then some hardware lanes become idle. Writing efficient GPU code requires an understanding of these characteristics.
Important statement: GPUs are most efficient when many threads perform the same operations on different data with regular memory access patterns and minimal control flow divergence.
Types of accelerators used in HPC
While GPUs are the dominant type of accelerator in current HPC systems, they are part of a broader class.
There are discrete GPUs that are separate devices connected to the CPU over an interconnect such as PCIe or NVLink. These have their own device memory and compute cores. The CPU and GPU communicate by explicit data transfers and kernel launches.
There are integrated GPUs that share a chip or memory subsystem with the CPU. These can reduce data movement overheads, but often have fewer resources than high end discrete accelerators.
In addition to GPUs, other accelerators appear in HPC:
There are manycore accelerators with a large number of simpler CPU like cores optimized for throughput.
There are FPGAs, or field programmable gate arrays, where the hardware logic can be configured by the user to implement custom data paths.
There are domain specific accelerators for particular workloads, such as tensor processing units for machine learning.
From the perspective of an HPC user, these devices share common themes. They are typically programmed through specialized APIs, they may require explicit data movement, and they provide performance advantages only when their strengths are matched to the structure of the algorithm.
CPU GPU interaction and offloading
In an HPC application that uses accelerators, the CPU remains the primary processor. It launches the program, handles input and output, and manages communication across the cluster. The GPU acts as an attached device. This division of labor is often called offloading.
The basic sequence of offloading looks like this. Data is prepared in main memory by CPU code. Then some of this data is transferred to device memory on the GPU. The program then launches one or more kernels on the GPU. While the kernels run, the GPU executes work independently. When the kernels finish, results are copied back to the CPU memory, and the CPU continues with the next steps in the workflow.
Some modern systems provide mechanisms for more integrated memory access. For example certain GPUs support unified virtual addressing so that CPU and GPU can address a common virtual memory space. However, even in these systems, performance often depends on understanding when data is physically moved and where it resides.
The offload model introduces an additional dimension of performance tuning. The arithmetic on the GPU may be extremely fast, yet the total time may be dominated by the time spent moving data between CPU and GPU. This leads directly to one of the most important rules for accelerator usage.
Important rule: To benefit from accelerators, maximize useful computation per byte of data transferred and avoid unnecessary CPU GPU data movement.
Programming models for GPUs and accelerators
From the HPC perspective, accelerators are visible through specific programming models. These programming models differ in syntax and ecosystem, but share common concepts such as kernels, device memory, and explicit parallelism.
One major category is vendor specific low level APIs, such as CUDA for NVIDIA GPUs and HIP or similar for AMD GPUs. These provide fine grained control over threads, blocks, and memory spaces, and are often used for performance critical kernels in scientific libraries.
Another category uses directive based models, where the programmer annotates existing CPU code with pragmas such as #pragma and a compiler generates the appropriate device code and data transfer calls. Examples are OpenACC and accelerator directives in OpenMP. These models aim to simplify porting existing codes by preserving much of the original structure.
There are also portable abstractions and frameworks. Some libraries and programming models abstract over different accelerators and provide a unified interface. Examples include Kokkos, SYCL, and domain specific frameworks in areas such as linear algebra or particle simulations.
The choice of model affects not only programming effort but also portability across systems. In a heterogeneous HPC landscape, where different clusters may have different accelerators, portability is a significant concern. This topic appears again later when software stacks and performance portability are discussed in more detail.
Architectural considerations for performance
Accelerators differ from CPUs in several important architectural aspects that influence algorithm design and optimization.
First, the memory hierarchy on GPUs is distinct from the CPU memory hierarchy and typically optimized for throughput. There are fast on chip memories for a group of threads, and larger global device memory with higher latency. Effective GPU use often depends on arranging data and computations to exploit these faster memories and to minimize uncoalesced or irregular accesses to global memory.
Second, GPUs are designed to execute a very high number of threads concurrently. This requires a high degree of exposed parallelism in the algorithm. If an algorithm contains long sequential dependencies, or if there are only a few independent tasks, then the GPU cannot be fully utilized.
Third, the GPU execution model favors regular control flow. Branch heavy code where many threads within an execution group take different branches will underutilize the hardware. Some algorithms must be reformulated to reduce divergence, for example by separating cases into different kernels or restructuring loops.
Fourth, accelerators often use specialized arithmetic units for particular operations. For example many GPUs have tensor cores or matrix cores that can perform small matrix operations extremely quickly. HPC codes that can leverage these units for dense linear algebra operations may gain substantial speedups, but may also need to adjust data layouts and precision choices.
Because accelerators are separate devices, node level performance is tied to three factors: GPU computational throughput, device memory bandwidth, and interconnect bandwidth between CPU and GPU. Efficient codes consider all three. This has practical implications for how often data is transferred, how data is laid out in memory, and how work is divided between CPU and GPU.
When accelerators help and when they do not
Not every HPC workload benefits equally from accelerators. Understanding when accelerators are suitable is part of the basic literacy for new HPC users.
Accelerators often shine in workloads that are compute intensive relative to their memory and communication costs. Examples include dense linear algebra, stencil computations with high arithmetic intensity, molecular dynamics force calculations, and many machine learning kernels. In such cases, the large number of GPU cores and their high floating point throughput can be well utilized.
Workloads with very irregular memory access patterns, heavy branching, or low arithmetic intensity may see smaller benefits or even slowdowns if naively ported. If a kernel performs only a small amount of computation per element and requires frequent synchronization or branching, the overheads of offloading and data movement may dominate. Some irregular algorithms can still gain from accelerators after careful redesign, but this often requires experience and experimentation.
Problem size matters. GPUs are designed to handle large amounts of parallel work. If a problem is too small, it may not generate enough threads to keep the GPU occupied. In such cases, the CPU may perform equally well or better because it avoids offloading overheads.
It is also important to consider development time and maintenance. Using accelerators can introduce additional complexity, especially when targeting multiple architectures. For some projects, the extra performance may justify this complexity. For others, relying on tuned libraries that already exploit accelerators may provide a good compromise between performance and programming effort.
The role of accelerators in modern supercomputers
The move toward exascale computing has placed accelerators at the center of large HPC systems. Many of the current top supercomputers use GPU accelerated nodes as building blocks. Each node contains a small number of CPUs and several GPUs, connected through high bandwidth interconnects, and nodes are linked across the cluster by high speed networks.
From the application perspective, this architecture encourages hierarchical parallelism. Some degree of parallelism is expressed at the cluster level, for example through MPI across nodes. Within a node, further parallelism is expressed across CPU cores and GPU devices. Within each GPU, fine grained parallelism is expressed across thousands of threads. Later chapters on hybrid parallel programming and accelerators describe how these layers interact in practice.
The emphasis on energy efficiency is another driver for accelerator based designs. GPUs often provide higher performance per watt than general purpose CPUs for suitable workloads. Since power and cooling are major constraints for large systems, accelerators are an attractive way to continue increasing performance without proportionally increasing energy consumption.
However, this shift also requires that HPC users become comfortable with heterogeneous computing. Applications that used to run entirely on CPUs may need to be refactored to exploit accelerators, or at least linked against libraries that do so internally. Understanding the basic concepts presented here will help you navigate those later topics.
Summary of key ideas about GPUs and accelerators
GPUs and other accelerators act as high throughput companions to CPUs in modern HPC systems. They are especially effective when many similar operations are performed on large data sets, in a way that exposes abundant fine grained parallelism and regular memory access patterns.
Accelerators are accessed through specific programming models and often require explicit data movement and kernel launches. Achieving good performance requires attention to arithmetic intensity, data locality between CPU and GPU, and match between algorithm structure and device architecture.
In later chapters on GPU programming models and performance analysis, these architectural principles reappear in more concrete form. For now, the main takeaway is conceptual. CPUs provide flexible control and moderate parallelism, accelerators provide massive throughput for suitable workloads, and effective HPC applications combine both in a coordinated way.