10.3 Introduction to CUDA

Table of Contents

What CUDA Is (and What It Is Not)

CUDA (Compute Unified Device Architecture) is NVIDIA’s programming model and toolchain for writing general-purpose code that runs on NVIDIA GPUs.

Key points:

CUDA is:

A C/C++/Fortran extension plus a runtime API for GPU programming.
A way to write “kernels” that execute in parallel on many GPU threads.
A set of libraries (cuBLAS, cuFFT, etc.) tuned for NVIDIA GPUs.

CUDA is not:

A general GPU standard (it is NVIDIA-specific).
Automatically portable to non-NVIDIA GPUs without changes.

In this chapter the focus is on foundational, practical ideas: the CUDA execution model, basic syntax, and typical workflow, not performance tuning or advanced features.

Host vs Device

CUDA explicitly separates:

Host: The CPU and its memory.
Device: The GPU and its memory.

A typical CUDA program:

Runs on the host (ordinary C/C++/Fortran code).
Allocates memory on the GPU.
Copies data from host to device.
Launches a kernel on the GPU.
Copies results back from device to host.
Frees GPU memory and exits.

Conceptually:

Host code is “normal” sequential (or CPU-parallel) code.
Device code is written as kernels that run many threads in parallel.

The CUDA Execution Model: Threads, Blocks, and Grids

CUDA provides a hierarchical way to organize parallel work:

Thread: The smallest unit of execution on the GPU (executes kernel code).
Block: A group of threads that:

Can cooperate via fast shared memory.
Can synchronize with __syncthreads().

Grid: A collection of blocks that execute one kernel launch.

You specify how many threads and blocks to use when launching a kernel. CUDA supports 1D, 2D, and 3D layouts for both threads and blocks, which helps with mapping to multidimensional data (e.g., images, matrices).

Illustration (conceptual):

Grid (many blocks)

Block 0 (many threads)
Block 1 (many threads)
…

The right choice of block size and grid size affects performance, but at this level you mainly need to understand how to compute indices from these structures.

Basic CUDA Program Structure

A minimal pattern in C/C++:

Define a kernel (runs on GPU):

Marked with __global__ (called from host, runs on device).
Each thread executes the same function body, but with different indices.

Write host code (runs on CPU):

Allocates and initializes host data.
Allocates device memory with cudaMalloc.
Copies data to device with cudaMemcpy.
Launches kernel with a special <<<...>>> syntax.
Copies results back with cudaMemcpy.
Frees device memory.

Example: vector addition C[i] = A[i] + B[i]

#include <stdio.h>
#include <cuda_runtime.h>
// 1. Kernel: runs on the GPU
__global__
void vector_add(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // global index
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}
int main() {
    int N = 1 << 20; // 1M elements
    size_t size = N * sizeof(float);
    // 2. Allocate and initialize host memory
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);
    for (int i = 0; i < N; ++i) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }
    // 3. Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);
    // 4. Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    // 5. Launch kernel: choose grid and block sizes
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vector_add<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    // Wait for GPU to finish (and check for errors in real code)
    cudaDeviceSynchronize();
    // 6. Copy results back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    // 7. Verify a few results
    printf("C[0] = %f (expected 3.0)\n", h_C[0]);
    // 8. Free device and host memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);
    return 0;
}

Core ideas to notice:

__global__ marks a kernel.
Launch syntax: kernel<<<numBlocks, threadsPerBlock>>>(...).
Use blockIdx, blockDim, and threadIdx to compute a unique index per thread.
Transfers between host and device are explicit.

Thread and Block Indexing

Within a kernel, CUDA provides built-in variables:

threadIdx, blockIdx: the thread’s and block’s index within a grid.
blockDim: number of threads per block.
gridDim: number of blocks in the grid.

For a 1D configuration:

Global linear index:
$$ \text{idx} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x} $$

For 2D and 3D layouts, you extend the idea using .x, .y, .z. Example for 2D:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

You typically check bounds (if (idx < N)) because your grid size may be slightly larger than your data size for convenience.

Memory Management Basics in CUDA

At this introductory level, you only need a few core functions from the CUDA Runtime API:

Allocate on device:

cudaMalloc((void**)&ptr, size);

Free device memory:

cudaFree(ptr);

Copy data:

Host → Device: cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
Device → Host: cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);
Device → Device: cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);

These transfers can be relatively slow compared to on-GPU computation, so even simple examples should aim to minimize unnecessary copies.

Kernel Launch Configuration

You control how much parallelism you expose via:

dim3 threadsPerBlock(...);
dim3 blocksPerGrid(...);

Example: process a 2D image of size width × height:

dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid(
    (width  + threadsPerBlock.x - 1) / threadsPerBlock.x,
    (height + threadsPerBlock.y - 1) / threadsPerBlock.y
);
image_kernel<<<blocksPerGrid, threadsPerBlock>>>(...);

Practical guidelines for beginners:

Use 128–1024 threads per block, often multiples of 32.
Compute the number of blocks so that blocks * threads >= total elements.

Fine-tuning choices for performance belongs in more advanced material; here you mainly need to understand the relationship between data size and launch parameters.

Compiling and Running CUDA Programs

CUDA programs are compiled with NVIDIA’s compiler nvcc:

Basic compilation:

  nvcc -o vector_add vector_add.cu

.cu is the conventional file extension for CUDA source files.
nvcc wraps a host C/C++ compiler (e.g., gcc) plus device compilation.

On an HPC cluster:

You often need to load a CUDA module or environment first, e.g.
module load cuda (details depend on your system’s module setup).
Executables are typically run on nodes with GPUs via the job scheduler.
You must request GPUs in your job script (covered in scheduler-related chapters).

Example interactive run (if a GPU is available on the current node):

./vector_add

Basic Error Handling

CUDA API calls and kernel launches can fail. At minimum you should:

Check return values of CUDA runtime calls:

  cudaError_t err = cudaMalloc((void**)&d_A, size);
  if (err != cudaSuccess) {
      fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
      return 1;
  }

After a kernel launch, check for launch errors and synchronize:

  vector_add<<<blocksPerGrid, threadsPerBlock>>>(...);
  cudaError_t errSync  = cudaDeviceSynchronize();
  cudaError_t errAsync = cudaGetLastError();
  if (errSync != cudaSuccess) {
      fprintf(stderr, "Sync error: %s\n", cudaGetErrorString(errSync));
  }
  if (errAsync != cudaSuccess) {
      fprintf(stderr, "Async kernel error: %s\n", cudaGetErrorString(errAsync));
  }

This minimal pattern makes debugging much easier, especially on clusters where you may not have direct interactive access to the GPU node outputs.

Typical First CUDA Tasks

For practice and to build intuition, beginners often start with:

Elementwise operations:

Vector addition, scaling, simple transformations.

Map-like operations:

Applying the same function to each array element.

Simple 2D image operations:

Inversion, thresholding, or basic filters.

These are all embarrassingly parallel and map naturally to the CUDA execution model: one thread per element (or pixel), with trivial indexing and limited control flow.

When to Use CUDA in an HPC Context

Within an HPC workflow, CUDA is particularly suitable when:

You have a large amount of data-parallel work:

The same operation repeated over many elements.

The computation per data element is substantial enough to:

Hide data transfer overhead between CPU and GPU.

You are working on NVIDIA-based GPU clusters:

Which is common in many HPC environments.

If your work is already covered by well-optimized CUDA-based libraries (e.g., cuBLAS for dense linear algebra, cuFFT for FFTs), it is often better to call those libraries rather than writing your own kernels from scratch.

Summary

In this introduction, you saw:

The conceptual split between host (CPU) and device (GPU) in CUDA.
The basic execution model: threads, blocks, and grids.
How to write and launch a simple CUDA kernel.
How to manage device memory and transfer data.
How to compile and run basic CUDA programs using nvcc.
Minimal error checking patterns suitable for beginners.

Subsequent chapters can build on this by exploring advanced memory types, optimization strategies, and higher-level directive-based models that can complement or replace hand-written CUDA in some HPC workloads.

Comments

Please login to add a comment.

Don't have an account? Register now!