GPU and Accelerator Computing

Table of Contents

Introduction

Graphics Processing Units and other accelerators have transformed modern high performance computing. Where traditional CPU based systems focused on a relatively small number of very powerful cores, GPUs provide thousands of simpler cores that excel at processing large numbers of similar operations at once. In many scientific and engineering applications this model matches the computations very well, which leads to dramatic speedups and improved energy efficiency.

In this chapter the focus is on the conceptual role of GPUs and accelerators in HPC, how they differ from CPUs at a high level, and what this means for application design and usage on clusters. Detailed architectural aspects and programming models will be covered in the dedicated chapters on GPU architecture basics, GPU memory hierarchy, CUDA, OpenACC, and accelerator performance considerations.

From Graphics to General Purpose Acceleration

GPUs were originally designed to render images and animations for computer graphics. This task involves applying similar mathematical operations to many pixels, vertices, or fragments. Hardware designers discovered that they could achieve high throughput by building many simple arithmetic units that all execute the same instruction on different data. Over time, vendors exposed this capability for non graphics tasks, which led to general purpose GPU computing.

In HPC, this evolution means that a GPU is no longer just a graphics device attached to a workstation. It is treated as a numerical accelerator that can run large parts of a scientific simulation, a machine learning workload, or a data analysis pipeline. When you run on an HPC cluster node with GPUs, the CPU is often responsible for overall control and coordination, while the GPU executes the most computationally intensive kernels.

The same trend created other accelerator types besides GPUs. These include many core coprocessors, custom ASICs for machine learning, and field programmable gate arrays. From the perspective of an HPC user, they all share a similar pattern. They are separate devices that you must target explicitly with an appropriate programming model, and they offer high performance for certain workloads at the cost of more complex programming and data management.

CPU versus GPU in the HPC Node

On a typical GPU equipped compute node, the CPU and GPU have clearly separated roles. The CPU is a latency optimized processor that excels at sequential control flow, complex branching, and running the operating system. The GPU is a throughput optimized processor that excels at performing the same operations across large data sets.

For an HPC application, this leads to a control plus accelerator model. The main program runs on the CPU, allocates and manages memory, handles file I/O, communicates with other nodes through MPI, and invokes GPU kernels to perform intensive numerical work. This model is reflected in the programming interfaces that you will encounter later, such as CUDA and OpenACC.

At the node level, GPUs are typically attached to the CPU through an interconnect, for example PCI Express or a dedicated high bandwidth link. From the programmer’s perspective the GPU has its own memory space, which is distinct from the main system memory. This separation strongly influences how you structure your computations and data transfers. In later chapters you will see patterns like copying input arrays from CPU memory to GPU memory, launching a parallel computation kernel on the GPU, and then copying results back.

An important conceptual rule: a GPU is not a faster drop in replacement for a CPU. It is a different device that must be used explicitly, usually to accelerate specific kernels that have a lot of regular, data parallel work.

Accelerators in the Cluster Context

In an HPC cluster, GPUs and other accelerators are resources that you request through the job scheduler. Some nodes in the cluster will be CPU only, while others will have a fixed number of GPUs attached. Job scripts must specify the type and quantity of accelerator resources that are required, for example by requesting one or more GPUs per node.

This resource model has two consequences. First, you must design your application to use the accelerators effectively on each node that you are given. Second, you must combine accelerator usage with whatever distributed memory model the cluster uses, usually MPI across nodes. In practice this leads to hybrid patterns where each MPI process controls one or more GPUs and offloads work to them while also participating in inter node communication.

From a workflow perspective, accelerator nodes may be more limited in number than CPU only nodes, and they may be more heavily contended. Efficient accelerator use is therefore not only a technical challenge but also a practical one, because poor utilization of GPUs during a job wastes scarce and valuable resources.

Workloads that Benefit from GPUs

Not every application benefits from GPU acceleration. The most suitable workloads tend to share some key characteristics. They involve a very large number of arithmetic operations, often on floating point data. They have a high degree of data parallelism, where the same or similar operation is applied independently to many elements of an array, grid, or particle list. Their control flow is relatively regular, with limited branching that depends on each data element.

Examples in HPC include stencil computations on grids, dense linear algebra, particle simulations, many spectral methods, and especially modern deep learning training and inference. For these workloads GPUs can deliver order of magnitude speedups compared to CPUs, provided that the code is written to exploit the GPU’s parallelism and memory hierarchy.

Workloads with irregular memory access, complex branching, or very small problem sizes are usually a poor match. In such cases the cost of moving data to the accelerator and managing its execution can outweigh any raw speed advantage. As a result many HPC applications adopt a mixed strategy, where some components remain on the CPU while others are ported to the GPU.

A useful practical rule: focus GPU acceleration on kernels that have a high ratio of arithmetic operations to data movement, often summarized as a high arithmetic intensity. These kernels are the most likely to benefit from the GPU’s throughput capabilities.

Programming Models for GPU and Accelerator Computing

Using a GPU or other accelerator requires a programming model that can express offload of computations and data. The most widely used low level model for NVIDIA GPUs is CUDA, which extends C, C++, and Fortran with keywords and runtime calls to manage device memory and launch kernels. CUDA exposes details of the GPU execution model, which gives fine grained control but also requires more effort.

Directive based models, such as OpenACC or GPU related extensions to OpenMP, aim to reduce this effort. In these models you annotate existing CPU code with pragmas that inform the compiler which loops or regions should be offloaded to an accelerator. The compiler then generates the appropriate device code and data movement. This approach can reduce code modifications and allows a smoother transition from CPU only to heterogeneous execution.

Other accelerators, such as AMD GPUs or Intel GPUs, have their own ecosystems, including vendor specific extensions and portable programming models like SYCL. In an HPC environment you will typically follow the programming model that is supported on the cluster you use. Later chapters will introduce CUDA and OpenACC in more detail and show how they are used on real systems.

Algorithmic Thinking for Accelerators

Effective accelerator usage requires a shift in algorithmic thinking. Instead of writing code that processes one element at a time in a sequential loop, you must consider how to express the computation as many concurrent operations. This might involve restructuring data layouts, breaking computations into kernels that the GPU can execute in parallel, and reducing dependencies between operations.

For example, a simple loop that updates each cell in a large array can be seen as a map like operation, where the same update rule is applied to each cell independently. On a GPU this would naturally become a kernel in which each thread handles one cell. More complex patterns, such as reductions that compute a global sum or norm, require parallel algorithms that combine partial results from many threads in a hierarchical way.

You also need to be aware of the cost of moving data between CPU and GPU memory. Algorithms that repeatedly send small pieces of data back and forth between host and device tend to perform poorly. Instead, you aim to move data to the accelerator once, perform as many operations as possible on it there, and only transfer essential results back.

A central design principle: minimize host device data transfers and maximize the amount of computation performed per transferred byte. This principle often determines whether GPU acceleration yields a speedup or a slowdown.

Numerical Precision and Accelerator Choices

GPU and accelerator hardware often supports multiple numerical precisions, such as double precision, single precision, and sometimes even lower precision formats. In HPC contexts, traditional simulations have relied heavily on double precision arithmetic, which offers high numerical fidelity. However, for some workloads, especially in machine learning, reduced precision can be sufficient and can offer significant performance gains.

When using accelerators, you must be aware that performance and hardware capabilities can vary by precision. Some GPUs deliver their highest floating point throughput in single precision or specialized formats used for deep learning. In contrast, double precision performance may be much lower. This can influence algorithm design and even the choice of hardware for a particular project.

The decision of which precision to use involves a tradeoff between performance and numerical accuracy. In some cases mixed precision methods are applied, where most arithmetic uses a lower precision to gain speed, but critical accumulations or final corrections use higher precision to control error growth. The suitability of such approaches is problem dependent and often requires careful testing and validation.

Portability and Vendor Ecosystems

One challenge in GPU and accelerator computing is the diversity of hardware and software ecosystems. Different vendors provide different driver stacks, compiler tools, and runtime libraries. Proprietary interfaces, such as vendor specific APIs, can lock an application to one hardware family. Portable models aim to mitigate this, but you must still consider what hardware is available in your target HPC environment.

From a practical standpoint, when developing for an HPC cluster you typically select the primary accelerator platform that the facility supports and use the recommended programming tools. You may then gradually introduce portable abstractions, such as higher level libraries or directive based models, in order to maintain some level of flexibility for future hardware transitions.

Portability also affects library usage. Many common numerical operations that you might otherwise implement by hand are available in optimized GPU libraries. Examples include dense linear algebra kernels, fast Fourier transforms, and random number generation. Using such libraries not only improves performance but also helps to shield your code from low level architecture details, improving its chances of running on future accelerators with fewer modifications.

Energy Efficiency and Accelerator Use

Accelerators are often introduced into HPC systems not only for raw speed but also for energy efficiency. For many arithmetic intensive workloads, a GPU can perform more floating point operations per second per watt than a general purpose CPU. This improves both operating costs and the environmental footprint of large scale computing.

However, the potential efficiency gains are only realized if you keep the accelerators well utilized. A GPU that is allocated to a job but spends most of its time idle still consumes power without contributing useful work. Poorly optimized offload patterns, excessive data transfers, or incorrect job sizing can all reduce the effective performance per watt.

In practice, energy efficiency is therefore closely tied to performance optimization. The same changes that improve computational throughput and reduce unnecessary data movement often reduce energy usage as well. HPC centers increasingly monitor and report power consumption, and future systems may integrate energy aware scheduling and allocation policies that take accelerator usage into account.

Summary

GPU and accelerator computing introduces a heterogeneous model of computation in which CPUs orchestrate control flow and communication, while accelerators provide high throughput numerical processing for suitable workloads. This model has become central to modern HPC systems, especially in domains that can exploit large scale data parallelism.

To use accelerators effectively, you must identify appropriate kernels in your application, adopt suitable programming models, and structure your algorithms to maximize on device computation while minimizing data movement. Later chapters will examine GPU architecture details, memory hierarchies, CUDA and OpenACC programming, and performance strategies that build on the conceptual foundation introduced here.

Why GPUs are used in HPC

GPU architecture basics

Memory hierarchy on GPUs

Introduction to CUDA

Introduction to OpenACC

Performance considerations for accelerators