10 GPU and Accelerator Computing

Table of Contents

Big Picture: Why This Chapter Matters

Modern high‑performance computing is no longer just about CPUs. A huge fraction of the world’s fastest supercomputers get their performance from GPUs and other accelerators. Even if you never write GPU code yourself, you will almost certainly:

Run software that uses GPUs internally,
Need to request GPU resources in a job script,
Interpret performance results affected by accelerators.

This chapter gives you the conceptual toolkit to understand what GPUs/accelerators are doing and how they fit into an HPC system. Later chapters (e.g., CUDA, OpenACC) will cover programming them in more detail.

What Are GPUs and Accelerators in HPC?

In HPC, a GPU (Graphics Processing Unit) is used as a general‑purpose accelerator: a specialized processor designed to execute many simple operations in parallel, extremely quickly. GPUs were originally built for graphics, but their architecture is ideal for many scientific and data‑intensive workloads.

More generally, an accelerator is any device added to a node to speed up specific types of computation, offloading work from the CPU. Common accelerator types in HPC include:

GPUs (NVIDIA, AMD, Intel)
Vector engines (e.g., NEC SX‑Aurora)
AI accelerators / TPUs
FPGAs (field‑programmable gate arrays) in some specialized systems

In a typical HPC node:

The CPU manages the operating system, I/O, and overall control.
One or more accelerators handle the heavy, parallelizable numerical work.

You will often see node descriptions like:

2 × 32-core CPUs + 4 × GPUs + 512 GB RAM

Understanding this balance between CPUs and accelerators is central to using these nodes effectively.

CPU vs GPU: Conceptual Differences

From an HPC user’s perspective, the key difference is many simple cores vs fewer complex cores:

CPU:

A small number of powerful, latency‑optimized cores.
Good at complex logic, branching, and sequential tasks.
Designed to run any kind of code reasonably well.

GPU:

Hundreds to thousands of simpler, throughput‑optimized cores.
Best at performing the same kind of operation on many data elements.
Struggles with highly branched, irregular control flow.

This makes GPUs ideal for data parallel workloads where your computation can be expressed as doing similar operations over large arrays, grids, or batches of data.

The Host–Device Model

Most current HPC systems use a heterogeneous node model:

Host: the CPU (and its memory).
Device: the GPU/accelerator (and its own memory).

Key points of the host–device model:

The CPU (host) launches and controls work on the accelerator (device).
The accelerator usually has its own memory (e.g., GPU memory or HBM).
Data often needs to be copied between host and device memory.
Performance can be limited not only by compute, but by data movement.

Common memory/connection terms you will encounter:

PCIe: A standard bus used to connect GPUs/FPGAs to CPUs.
NVLink, Infinity Fabric, etc.: Faster interconnects between GPUs and sometimes between CPU and GPU.
Unified memory: A programming abstraction that hides explicit copies, though the underlying hardware may still move data.

Understanding when and how data moves between CPU and GPU is crucial for both performance and correct job configuration.

Where GPUs Fit in an HPC Cluster

In a cluster, not all nodes necessarily have GPUs:

CPU‑only nodes: traditional general‑purpose compute nodes.
GPU nodes / accelerator nodes: equipped with one or more GPUs or other accelerators.

You might see cluster partitions or queues like:

cpu, standard, or short — CPU‑only nodes
gpu, gpu-long, a100, mi250 — GPU nodes, possibly with specific hardware types

From a workflow perspective:

You log in to a login node (usually CPU‑only).
You submit a job to a GPU partition.
The scheduler allocates you a GPU node with the requested number of GPUs.
Your job’s executable:

Runs on the CPU, and
Offloads selected kernels or loops to GPUs.

How many GPUs you can request and how you specify them depends on the cluster’s job scheduler configuration, which you’ll connect with in the job scheduling chapter.

Types of Accelerated Workloads

Certain kinds of problems map particularly well to accelerators. Common classes include:

Dense linear algebra:

Matrix–matrix operations, solvers, factorizations.
Many BLAS/LAPACK‑like operations have GPU‑accelerated implementations.

Stencil and grid‑based methods:

PDE solvers, CFD, climate models, structured meshes.
Regular grid computations over large arrays.

Particle and agent‑based methods:

Molecular dynamics, N‑body simulations.
Many similar calculations for each particle/agent.

Signal and image processing:

FFTs, convolutions, filters, image transformations.

Machine learning / AI:

Training and inference for deep neural networks.
Matrix multiplications and convolutions at massive scale.

Monte Carlo and other embarrassingly parallel tasks:

Large numbers of independent or weakly‑coupled simulations.

If your workload can be expressed as the same operations over large data sets with minimal dependencies between elements, it is a strong candidate for acceleration.

When GPUs May Not Help

Accelerators are not beneficial for every workload. Common cases where GPUs struggle:

Strongly sequential algorithms with little parallelism.
Highly irregular memory access patterns.
Branch‑heavy code with lots of conditionals and divergent behavior.
Very small problems:

Overhead of offloading and data movement can dominate runtime.

I/O‑bound workloads:

Jobs dominated by reading/writing large files rather than computing.

In such cases, CPU‑only execution may be simpler and faster.

Performance Considerations at a High Level

Later chapters will go into optimization details. At this stage, it is important to know the main performance levers conceptually:

Arithmetic intensity:

Ratio of computation (floating‑point operations) to data movement (bytes).
GPUs shine when arithmetic intensity is high: many operations per byte moved.

Data movement costs:

Moving data between CPU and GPU memory can be expensive.
Good GPU codes minimize host–device transfers and reuse data on the device.

Occupancy / parallelism:

GPUs want a lot of parallel work.
You need enough independent work units (threads, data items) to keep the device busy.

Load balance:

Across GPUs in a node and across nodes in the cluster.
Imbalanced workloads can leave some GPUs idle while others are overloaded.

Mixed precision:

Many accelerators offer much higher performance at reduced precision (e.g., FP16, bfloat16, Tensor Cores).
HPC codes increasingly exploit lower precision for parts of the computation where accuracy is less critical.

These ideas will appear repeatedly when analyzing and tuning performance.

How GPU Programming Fits into the Software Stack

Accelerators are used in practice through several layers:

Low‑level programming models (covered later):

CUDA (NVIDIA), HIP (AMD), SYCL/oneAPI (Intel and others).
Direct control but more code complexity and vendor specificity.

Directive‑based models:

OpenACC, OpenMP offloading.
Add annotations (#pragma/directives) to existing CPU code to offload parts to accelerators.

High‑level libraries and frameworks:

Linear algebra libraries (e.g., cuBLAS, rocBLAS, oneMKL).
FFT libraries (e.g., cuFFT, rocFFT).
Machine learning frameworks (e.g., PyTorch, TensorFlow).
Domain‑specific packages (e.g., GPU‑enabled molecular dynamics, CFD).

In many real‑world HPC workflows, you may never write CUDA or OpenACC yourself. Instead, you:

Use prebuilt GPU‑aware libraries,
Enable GPU support in existing applications, or
Choose software builds compiled for specific accelerators.

Understanding which parts of your workflow are GPU‑accelerated and which are not is important for interpreting runtime and scaling behavior.

Practical Aspects of Using Accelerators in HPC

From a user standpoint, several practical issues arise when working with accelerators:

Requesting GPUs in job scripts:

You typically specify:

Number of GPUs per node (e.g., 1, 2, 4…)
Possibly GPU type (e.g., a100, v100, mi250)

Resource requests affect which nodes the scheduler can assign.

Matching software to hardware:

Some builds are specific to GPU vendors or generations.
Using the wrong build (or an unsupported driver) can lead to poor performance or failures.

Memory constraints:

GPU memory is often much smaller than system RAM.
You need to be aware of per‑GPU memory limits when choosing problem sizes.

Multi‑GPU jobs:

A single node may have several GPUs; a cluster may have many such nodes.
Codes may use:

One GPU per MPI rank,
Shared GPUs among threads,
Hybrid MPI+GPU allocations.

Error handling and robustness:

GPU jobs can fail due to:

Out‑of‑memory on the device,
Driver/runtime errors,
Using incompatible software stacks.

Understanding logs and error messages is important for debugging.

These operational details tie into job scheduling, software stacks, and debugging chapters, where you’ll see concrete examples.

How GPUs Interact with Parallel Programming Models

HPC codes often combine multiple layers of parallelism:

MPI across nodes (distributed memory).
Shared‑memory threading (e.g., OpenMP) within a node.
GPU parallelism on each accelerator.

Common patterns include:

MPI rank per GPU:

Each rank controls one GPU.
Good for codes that were already MPI‑parallel.

Hybrid MPI + threads + GPUs:

Fewer MPI ranks per node, each with multiple CPU threads and access to multiple GPUs.
Balances communication overhead with GPU utilization.

Task‑based or asynchronous execution:

CPU prepares data, launches GPU work, continues with other tasks while GPU runs.
Overlaps computation and communication.

The choice of pattern affects both performance and scalability. Later chapters on hybrid programming and performance optimization will revisit these designs.

Emerging Trends in Accelerator‑Based HPC

Accelerator technologies evolve rapidly. Some important trends relevant for future‑proofing your skills:

Vendor diversity:

NVIDIA, AMD, and Intel GPUs coexist in modern HPC systems.
Portable programming models (e.g., SYCL, OpenMP offload) aim to reduce vendor lock‑in.

Domain‑specific accelerators:

AI‑focused devices, matrix engines, and tensor cores.
Increasing integration into scientific applications.

Tighter integration with CPUs:

Unified or shared memory models,
On‑package or stacked memory,
Heterogeneous chips combining CPU and GPU cores.

Energy efficiency focus:

GPUs deliver more FLOPs per watt than CPUs for many workloads.
Accelerator usage is a key part of green HPC strategies.

Keeping an eye on how your field’s software evolves with these trends will help you choose appropriate systems and tools for your work.

Summary

In this chapter, you’ve seen GPUs and accelerators as:

Specialized processors that complement CPUs in HPC nodes.
Extremely effective for highly parallel, data‑intensive workloads.
Integrated into clusters as dedicated GPU nodes or accelerator partitions.
Accessible via low‑level APIs, directives, and high‑level libraries.
Central to current and future trends in performance and energy‑efficient computing.

Later chapters will dive into GPU architecture details, specific programming models (CUDA, OpenACC), and techniques for getting the best performance out of accelerator‑based systems.

10.1 GPU architecture basics

10.2 Memory hierarchy on GPUs

10.3 Introduction to CUDA

10.4 Introduction to OpenACC

10.5 Performance considerations for accelerators