Kahibaro
Discord Login Register

Introduction to CUDA

What CUDA Is (and What It Is Not)

CUDA (Compute Unified Device Architecture) is NVIDIA’s programming model and toolchain for writing general-purpose code that runs on NVIDIA GPUs.

Key points:

In this chapter the focus is on foundational, practical ideas: the CUDA execution model, basic syntax, and typical workflow, not performance tuning or advanced features.

Host vs Device

CUDA explicitly separates:

A typical CUDA program:

  1. Runs on the host (ordinary C/C++/Fortran code).
  2. Allocates memory on the GPU.
  3. Copies data from host to device.
  4. Launches a kernel on the GPU.
  5. Copies results back from device to host.
  6. Frees GPU memory and exits.

Conceptually:

The CUDA Execution Model: Threads, Blocks, and Grids

CUDA provides a hierarchical way to organize parallel work:

You specify how many threads and blocks to use when launching a kernel. CUDA supports 1D, 2D, and 3D layouts for both threads and blocks, which helps with mapping to multidimensional data (e.g., images, matrices).

Illustration (conceptual):

The right choice of block size and grid size affects performance, but at this level you mainly need to understand how to compute indices from these structures.

Basic CUDA Program Structure

A minimal pattern in C/C++:

  1. Define a kernel (runs on GPU):
    • Marked with __global__ (called from host, runs on device).
    • Each thread executes the same function body, but with different indices.
  2. Write host code (runs on CPU):
    • Allocates and initializes host data.
    • Allocates device memory with cudaMalloc.
    • Copies data to device with cudaMemcpy.
    • Launches kernel with a special <<<...>>> syntax.
    • Copies results back with cudaMemcpy.
    • Frees device memory.

Example: vector addition C[i] = A[i] + B[i]

#include <stdio.h>
#include <cuda_runtime.h>
// 1. Kernel: runs on the GPU
__global__
void vector_add(const float *A, const float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // global index
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}
int main() {
    int N = 1 << 20; // 1M elements
    size_t size = N * sizeof(float);
    // 2. Allocate and initialize host memory
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);
    for (int i = 0; i < N; ++i) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }
    // 3. Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);
    // 4. Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    // 5. Launch kernel: choose grid and block sizes
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vector_add<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    // Wait for GPU to finish (and check for errors in real code)
    cudaDeviceSynchronize();
    // 6. Copy results back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    // 7. Verify a few results
    printf("C[0] = %f (expected 3.0)\n", h_C[0]);
    // 8. Free device and host memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);
    return 0;
}

Core ideas to notice:

Thread and Block Indexing

Within a kernel, CUDA provides built-in variables:

For a 1D configuration:

For 2D and 3D layouts, you extend the idea using .x, .y, .z. Example for 2D:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

You typically check bounds (if (idx < N)) because your grid size may be slightly larger than your data size for convenience.

Memory Management Basics in CUDA

At this introductory level, you only need a few core functions from the CUDA Runtime API:

These transfers can be relatively slow compared to on-GPU computation, so even simple examples should aim to minimize unnecessary copies.

Kernel Launch Configuration

You control how much parallelism you expose via:

Example: process a 2D image of size width × height:

dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid(
    (width  + threadsPerBlock.x - 1) / threadsPerBlock.x,
    (height + threadsPerBlock.y - 1) / threadsPerBlock.y
);
image_kernel<<<blocksPerGrid, threadsPerBlock>>>(...);

Practical guidelines for beginners:

Fine-tuning choices for performance belongs in more advanced material; here you mainly need to understand the relationship between data size and launch parameters.

Compiling and Running CUDA Programs

CUDA programs are compiled with NVIDIA’s compiler nvcc:

  nvcc -o vector_add vector_add.cu

Example interactive run (if a GPU is available on the current node):

./vector_add

Basic Error Handling

CUDA API calls and kernel launches can fail. At minimum you should:

  cudaError_t err = cudaMalloc((void**)&d_A, size);
  if (err != cudaSuccess) {
      fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
      return 1;
  }
  vector_add<<<blocksPerGrid, threadsPerBlock>>>(...);
  cudaError_t errSync  = cudaDeviceSynchronize();
  cudaError_t errAsync = cudaGetLastError();
  if (errSync != cudaSuccess) {
      fprintf(stderr, "Sync error: %s\n", cudaGetErrorString(errSync));
  }
  if (errAsync != cudaSuccess) {
      fprintf(stderr, "Async kernel error: %s\n", cudaGetErrorString(errAsync));
  }

This minimal pattern makes debugging much easier, especially on clusters where you may not have direct interactive access to the GPU node outputs.

Typical First CUDA Tasks

For practice and to build intuition, beginners often start with:

These are all embarrassingly parallel and map naturally to the CUDA execution model: one thread per element (or pixel), with trivial indexing and limited control flow.

When to Use CUDA in an HPC Context

Within an HPC workflow, CUDA is particularly suitable when:

If your work is already covered by well-optimized CUDA-based libraries (e.g., cuBLAS for dense linear algebra, cuFFT for FFTs), it is often better to call those libraries rather than writing your own kernels from scratch.

Summary

In this introduction, you saw:

Subsequent chapters can build on this by exploring advanced memory types, optimization strategies, and higher-level directive-based models that can complement or replace hand-written CUDA in some HPC workloads.

Views: 14

Comments

Please login to add a comment.

Don't have an account? Register now!