10.5 Introduction to OpenACC

Table of Contents

What OpenACC Is and Where It Fits

OpenACC is a programming model that lets you offload parts of your code to GPUs and other accelerators using compiler directives, also called pragmas or comments, rather than rewriting everything in a low level GPU language. Its main goal is to make GPU programming more incremental and portable so you can start from existing serial or OpenMP style code and mark the computationally heavy regions for acceleration.

Unlike CUDA, which is specific to NVIDIA GPUs and requires explicit kernel code, OpenACC is vendor neutral and designed to work across different accelerators when supported by the compiler. You describe what should run on the device and what data must move, and the compiler generates the low level code and transfers.

OpenACC is especially attractive if you already have a large CPU code and want to experiment with GPU acceleration without a full rewrite, or if you aim for a single source that can still compile and run on systems without GPUs.

Basic Programming Model

OpenACC follows a directive based model. In C and C++ you use #pragma acc before loops or blocks of code. In Fortran you use special comment like lines such as !$acc. If the compiler understands OpenACC, it uses these directives to generate GPU code. If not, the directives are ignored as comments and the program still compiles and runs on a CPU.

The OpenACC model revolves around three basic ideas. First, there is a host, usually the CPU, and one or more devices, such as GPUs. Second, you mark regions of your code that should execute on the device. Third, you specify what data must be present on the device for those regions.

A minimal OpenACC workflow is: identify a hot loop, add a parallel or kernels directive, and ensure the required arrays are copied to the device. From there you can refine parallelism and data movement.

A First OpenACC Example

Consider a simple vector addition in C. The serial version has a loop that computes c[i] = a[i] + b[i]. With OpenACC, you can offload this loop to a GPU by adding a directive that describes the parallelism and data usage.

#include <stdio.h>
#include <stdlib.h>
int main() {
    int n = 1000000;
    float *a = malloc(n * sizeof(float));
    float *b = malloc(n * sizeof(float));
    float *c = malloc(n * sizeof(float));
    for (int i = 0; i < n; i++) {
        a[i] = i;
        b[i] = 2.0f * i;
    }
    #pragma acc parallel loop copyin(a[0:n], b[0:n]) copyout(c[0:n])
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
    printf("c[10] = %f\n", c[10]);
    free(a);
    free(b);
    free(c);
    return 0;
}

The parallel loop directive tells the compiler to create a parallel region on the device and to parallelize the loop. The copyin and copyout clauses specify that a and b are input arrays that must be copied from host to device, and c is an output array that must be copied from device to host when the region ends.

In Fortran the pattern is similar, with syntax adapted to Fortran arrays and comments. The fundamental idea is the same: you mark loops and data, the compiler performs the offload.

Core Concepts: Parallel Regions and Loops

OpenACC provides constructs that describe how computation is mapped to the device. For beginners, the most important are parallel, kernels, and loop.

A parallel region describes a block of code that runs as a single accelerator kernel, with explicit parallelism for loops. You typically write #pragma acc parallel loop in C or !$acc parallel loop in Fortran before a loop nest. You then help the compiler understand which loops are independent and can be distributed across the GPU threads.

A kernels region is more descriptive. You enclose one or more loops in #pragma acc kernels { ... } or the Fortran equivalent, and the compiler analyzes each loop to generate one or more kernels. kernels can be easier to start with, since it requires less manual mapping, but parallel usually gives more explicit control and often better performance once you understand your code.

The loop directive is used inside a parallel region, or combined with it, to mark specific loops. For nested loops you can use features like collapse to tell the compiler to treat multiple loops as a single big iteration space. For example, #pragma acc parallel loop collapse(2) indicates that a 2D loop nest should be flattened for parallel execution.

In OpenACC, loops you mark as parallel must be free of loop carried dependencies. If one iteration depends on results from another, then parallelizing the loop will produce incorrect results. You must ensure loop iterations are independent or correctly use reductions and other clauses when dependencies are structured.

Data Management in OpenACC

GPU performance is heavily influenced by data movement between CPU and GPU. OpenACC gives you data clauses and data regions to control when arrays are transferred.

The simplest way, shown earlier, is to attach data clauses directly to a parallel or kernels directive. Clauses like copy, copyin, copyout, create, and present describe the role of variables. copy(a[0:n]) means allocate storage for a on the device, copy it from host to device at region entry, and copy any updates back to the host at region exit. copyin means input only. copyout means output only. create means allocate on device without copying initial values. present means that you assert the data is already on the device, for example from an enclosing data region.

Often it is more efficient to create explicit data regions around a sequence of kernels. Instead of copying arrays to the device for each loop, you use a data region to keep the arrays resident on the device across multiple compute regions.

For example:

#pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
{
    #pragma acc parallel loop
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
    #pragma acc parallel loop
    for (int i = 0; i < n; i++) {
        c[i] = c[i] * 2.0f;
    }
}

Here a and b are copied to the device once at the start of the data region, c is allocated and copied back at the end, and both loops reuse the data without additional transfers.

You can also manage data more explicitly with enter data and exit data directives, which operate outside of compute regions and let you allocate device data early and free it late, for example at the beginning and end of a time stepping simulation.

A key rule for performance is to minimize host device data transfers. Use persistent data regions or enter data and exit data to keep data on the device across multiple kernels, instead of copying in and out for every loop.

Mapping to GPU Hardware: Gangs, Workers, and Vectors

OpenACC abstracts the GPU execution hierarchy with three levels of parallelism called gangs, workers, and vectors. Different accelerators map these levels differently to hardware.

On many NVIDIA GPUs, gangs roughly correspond to thread blocks, workers to warps within a block, and vectors to individual threads. This mapping is not exact and is implementation dependent, but it gives a mental model: gangs represent coarse parallelism across the device, workers represent intermediate groups, and vectors represent the innermost fine grained parallelism.

You can help the compiler by annotating loops with clauses like gang, worker, and vector. For example:

#pragma acc parallel
#pragma acc loop gang
for (int i = 0; i < n; i++) {
    #pragma acc loop vector
    for (int j = 0; j < m; j++) {
        // compute with a[i][j]
    }
}

In many simple cases you do not need to specify these explicitly, and the compiler chooses a mapping. However, understanding this hierarchy becomes important for tuning performance, especially for nested loops and memory access patterns.

Reductions and Other Common Patterns

Many numerical kernels involve sums, minima, maxima, or other global operations over arrays. These are reductions, and OpenACC provides reduction clauses that instruct the compiler to generate the correct parallel reduction pattern.

A basic example:

double sum = 0.0;
#pragma acc parallel loop reduction(+:sum) copyin(a[0:n])
for (int i = 0; i < n; i++) {
    sum += a[i];
}

The reduction(+:sum) clause says that each parallel worker should accumulate into its own private copy of sum, and then combine these partial sums at the end in a way that is mathematically equivalent to serial addition.

Other reduction operators are available, such as * for products, max for maxima, and min for minima. You should use reduction clauses whenever a loop involves an accumulation across iterations, since naive use of a shared variable would create race conditions.

OpenACC also supports clauses for private and shared data within a parallel region. For variables that must be private to each iteration, such as temporary scalars, you can use private(var). This helps avoid unintended sharing between parallel iterations.

When using OpenACC on loops that update a scalar across iterations, always consider whether you need a reduction clause or a private variable. Avoid writing to shared scalars from multiple iterations without an appropriate clause, because this introduces race conditions and undefined results.

Asynchronous Execution

OpenACC includes features for asynchronous execution so that you can overlap data transfers and computation or run multiple kernels concurrently. This is controlled with the async clause and the wait directive.

If you add async to a parallel, kernels, or data construct, the corresponding operation can execute asynchronously with respect to the host. You can also specify a numeric queue identifier, for example async(1), to group operations that should share the same stream or queue on the device.