Table of Contents
CUDA in the HPC Landscape
CUDA is a programming model and platform created by NVIDIA that allows you to write general purpose programs that run on NVIDIA GPUs. In this chapter the focus is on the core ideas you need to start reading, understanding, and writing simple CUDA programs, not on mastering GPU performance tuning, which is covered later.
CUDA lets you write C, C++, Fortran, and Python code that explicitly launches work on a GPU. You express parallelism by writing functions that run in many threads in parallel. CUDA then maps these threads to the GPU hardware that was introduced in the GPU architecture basics chapter.
A useful mental model is that you write a program where the CPU, often called the host, coordinates work, while the GPU, often called the device, executes large numbers of parallel threads on data that you prepare for it.
CUDA is a vendor specific, NVIDIA only technology. CUDA code will not run on non NVIDIA GPUs without translation layers or re implementation.
Host and Device
In CUDA terminology the CPU and its memory are called the host, and the GPU and its memory are called the device. A typical CUDA program has both host code and device code.
Host code runs on the CPU. It performs tasks such as allocating memory on the host and on the device, copying data between them, launching GPU kernels, and collecting results.
Device code runs on the GPU. These are functions that execute in many parallel threads, each handling its own portion of data. In CUDA C or C++, such functions are marked with the __global__ or __device__ qualifiers.
Data must be where the computation happens. If a kernel accesses data, that data normally lives in GPU memory. If CPU code accesses data, that data must be in host memory or be visible through a special mechanism. Moving data between host and device involves explicit memory copies, and these can dominate runtime if done carelessly.
The CUDA Programming Model
CUDA uses a Single Instruction, Multiple Thread execution model. You write a kernel, which is a function that describes the work for a single thread. When you launch the kernel, you specify how many parallel threads should run that function. The hardware schedules and groups these threads.
The hierarchy of parallelism has three layers that are visible in CUDA code. At the top level, you launch a grid. A grid is a collection of thread blocks that all execute the same kernel. Inside the grid, there are blocks, which are groups of threads. Each block executes on a single streaming multiprocessor and can use shared memory and synchronization within the block. Inside each block, there are threads. Each thread has its own registers, private local memory, and its own identifiers.
You will often use these identifiers in your kernel to compute which portion of the input a thread should work on. In CUDA C, the runtime provides built in variables such as blockIdx, blockDim, and threadIdx that let you find out where a thread is located in this hierarchy.
A First CUDA Kernel
The smallest useful CUDA example typically follows a pattern. On the host you allocate memory on both host and device, copy input data to the GPU, configure and launch a kernel, and finally copy the results back and free memory.
A minimal illustrative kernel that adds two vectors of length N can look like this in CUDA C:
__global__ void vector_add(const float *a, const float *b, float *c, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
c[i] = a[i] + b[i];
}
}
The __global__ keyword marks this as a kernel function that can be launched from the host and runs on the device. Inside the kernel, every thread computes its own index i by combining its block index and its thread index inside the block. The if (i < N) check avoids out of bounds access, because there are often more threads in the grid than elements in the data, especially when you round up the number of blocks.
On the host you would typically launch this kernel as follows:
int N = 1 << 20; // 1,048,576 elements
size_t size = N * sizeof(float);
// Allocate host memory and initialize a, b, c
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
vector_add<<<numBlocks, blockSize>>>(d_a, d_b, d_c, N);
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
The triple angle bracket syntax <<<numBlocks, blockSize>>> is CUDA specific kernel launch syntax. The first parameter configures the grid dimension, the second the block dimension. In this example both are one dimensional.
CUDA Thread and Block Indexing
Understanding how to translate a position in the grid into data indices is a core skill in CUDA programming. In a one dimensional launch, as shown above, the computation of a global index is
$$
i = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}.
$$
When data is naturally two dimensional or three dimensional, for example in images or spatial grids, you can use two or three dimensional blocks and grids. CUDA provides blockIdx.y, blockIdx.z, threadIdx.y, threadIdx.z, and corresponding blockDim components.
For example, in a two dimensional grid, a thread might compute its 2D coordinates as
$$
i = \text{blockIdx.y} \times \text{blockDim.y} + \text{threadIdx.y},
$$
$$
j = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}.
$$
You can then map these to a linear index if the data is stored as a flat array. Choosing a convenient layout of threads often simplifies indexing and helps with memory access patterns, which are important for performance.
A correct mapping from thread and block indices to data indices is essential. Common bugs in CUDA kernels include out of bounds memory access caused by incorrect index calculations or missing bounds checks.
CUDA Memory Management Basics
From the host perspective, working with GPU memory involves explicit allocation and explicit copies. The basic functions in the CUDA Runtime API are:
cudaMalloc allocates memory on the device. You supply a pointer to a pointer and the number of bytes. The device pointer cannot be dereferenced in host code.
cudaFree frees device memory that was allocated with cudaMalloc.
cudaMemcpy copies data between host and device or between device regions. You must specify the source, destination, size, and the direction such as cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost.
In a typical workflow, you allocate host arrays using standard mechanisms in your language of choice, then allocate corresponding device buffers, copy data to the device, run kernels that operate on the device data, and finally copy back any results that the host needs.
Modern CUDA also supports unified memory where cudaMallocManaged allocates memory that is accessible by both host and device. The system handles migration between them. This can simplify programming at the cost of reduced explicit control. For beginners, it is important to understand the explicit memory model first, because it matches the underlying hardware reality and is closer to what performance oriented codes use.
Compiling and Running CUDA Programs
CUDA programs in C or C++ are usually compiled with the NVIDIA CUDA compiler nvcc. The typical workflow is similar to compiling a C or C++ program, but nvcc handles both host and device code, and generates code for GPUs with specific compute capabilities.
For example, to compile a simple CUDA program stored in vecadd.cu you might use
nvcc -O2 -o vecadd vecadd.cuYou then run the resulting executable as any other program on a system with an appropriate NVIDIA GPU and drivers. On HPC clusters, CUDA applications are often compiled on login nodes where CUDA toolkits are provided through environment modules, and executed on GPU enabled compute nodes under a job scheduler.
It is important to match the CUDA toolkit version with the driver and GPU generation available on the cluster. This is normally handled by the system administrators, but users should be aware that CUDA code may fail to run if there is a mismatch.
Kernels, Streams, and Asynchrony
Kernel launches in CUDA are asynchronous with respect to the host by default. This means that once you call a kernel, the host code continues immediately while the kernel runs on the device. Similarly, some memory operations can be asynchronous.
CUDA provides streams to control this asynchrony. A stream is a sequence of operations that execute in order relative to each other, but operations in different streams can overlap. The default stream is used when you do not specify any stream explicitly.
In simple introductory programs you can ignore streams and rely on implicit synchronization points, such as cudaMemcpy between host and device or explicit calls such as cudaDeviceSynchronize. As you move toward more advanced usage and wish to overlap computation with data transfers, streams become important. For now, it is enough to recognize that kernel launches do not by themselves block the host.
Asynchronous execution means that errors in kernels may not be detected immediately. Always check the result of CUDA API calls with cudaGetLastError or related mechanisms in real codes.
Thread Cooperation and Shared Memory
CUDA allows threads within the same block to cooperate directly. They can share data via shared memory and synchronize with each other. Shared memory is a fast, explicitly managed memory space local to a block.
A typical pattern is to copy a portion of global memory into shared memory, let all threads in the block work on it, and then write results back to global memory. To safely coordinate steps of such cooperation, CUDA provides a barrier function __syncthreads() which ensures that all threads in a block reach the same point before any of them proceed.
Correct use of shared memory and synchronization is central to high performance kernels, but it also introduces potential for bugs such as deadlocks when not all threads in a block reach the barrier. In this introductory chapter it is important to recognize that shared memory exists, that it is per block, and that it allows collaborative algorithms that are not possible with completely independent threads.
Error Handling and Basic Debugging
CUDA API functions return error codes. Ignoring these codes makes debugging very difficult, especially since execution is often asynchronous. A simple pattern is to check the return value of each CUDA call and abort or report an error if it is not cudaSuccess.
For kernels, error reporting works differently. Kernel launches can fail due to configuration errors or illegal memory accesses. You can detect some errors by calling cudaGetLastError after a kernel launch and using tools such as cuda-memcheck to detect invalid memory accesses.
For a first contact with CUDA, you should at least follow these practices. Check every cudaMalloc, cudaMemcpy, and kernel launch for errors, keep kernel code simple and well commented, and test kernels on small data sizes where you can compare against a serial CPU implementation.
In CUDA, unchecked errors can silently corrupt results. Always validate CUDA calls and compare GPU results with a known correct CPU version when possible.
Higher Level CUDA Ecosystem
Beyond the low level CUDA C interface there exists a large ecosystem of higher level tools and libraries. Many users never write kernels directly, but use libraries such as cuBLAS for linear algebra, cuFFT for Fourier transforms, and cuDNN for deep learning primitives. These libraries are highly optimized and hide most of the details of the programming model.
Bindings and wrappers also exist for languages like Python through projects such as Numba, CuPy, and PyTorch. They allow you to target CUDA GPUs without writing C or C++, which can be attractive for rapid development.
In the context of HPC, understanding the underlying CUDA model is valuable even if you mostly rely on libraries. It helps you choose sensible problem sizes, data layouts, and workflows, and it provides the basis for custom kernels when library routines are not sufficient.
Summary
CUDA exposes GPU computing through an explicit programming model that separates host and device, and that expresses parallel work as kernels executed by many threads organized into blocks and grids. A basic CUDA program performs explicit device memory management, launches kernels with appropriate configuration, and handles asynchrony and errors correctly.
For beginners, the main goals are to become comfortable with the host device split, to understand thread and block indexing, and to compile and run simple kernels. Deeper performance topics, such as careful tuning of memory access patterns, use of shared memory for performance, and advanced asynchrony with streams, build on these fundamental concepts and are treated elsewhere in the course.