10.4 Introduction to OpenACC

Table of Contents

What OpenACC Is and When to Use It

OpenACC is a high-level, directive-based programming model designed to make it easier to run existing CPU code on GPUs and other accelerators. Instead of writing explicit GPU kernels (as in CUDA), you add hints (pragmas/directives) to your C, C++, or Fortran code that tell the compiler which parts to offload and how to manage data.

Key characteristics:

Directive-based: You annotate loops and regions using #pragma acc (C/C++) or !$acc (Fortran).
Portable: The same code can compile and run on systems with or without GPUs (it can fall back to CPU execution).
Incremental: You can start by accelerating only a few hotspots and gradually extend coverage.
Target-agnostic: Although heavily used for GPUs, OpenACC is conceptually meant for a variety of accelerators.

Typical use cases:

Large, compute-intensive loops over arrays and grids (e.g., PDE solvers, stencil codes).
Codes that already exist in Fortran or C/C++ and should be ported to GPUs with minimal restructuring.
Situations where you want performance improvements without fully rewriting the code in a GPU-specific language.

Basic Programming Model and Directives

OpenACC introduces three closely related ideas:

Compute regions: Where the computation runs (e.g., on a GPU).
Parallelism: How the computation is decomposed into parallel work.
Data regions: How data is moved between host (CPU) and device (GPU).

In C/C++, directives take the form:

#pragma acc <keyword> [clauses...]

In Fortran:

fortran

!$acc <keyword> [clauses...]

The most important high-level directives to know initially are:

parallel / kernels: Define regions that will run on the accelerator.
loop: Apply parallelization directives to loops inside a region.
data: Control lifetime and movement of arrays and variables.
update: Synchronize data between host and device during a region.
enter data / exit data: Manage persistent data outside explicit regions.

At an introductory level, you typically start with parallel loop and a few data clauses, then refine from there.

First Example: Offloading a Simple Loop

Consider a plain C loop that performs a SAXPY operation ($y = a x + y$):

void saxpy(int n, float a, float *x, float *y)
{
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

A first OpenACC version to offload this to the GPU might look like:

void saxpy(int n, float a, float *x, float *y)
{
    #pragma acc parallel loop copyin(x[0:n]) copy(y[0:n])
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

What this says conceptually:

parallel loop: Run this loop on the accelerator, in parallel.
copyin(x[0:n]): Allocate space for x on the device and copy data from host to device before the region.
copy(y[0:n]): Allocate space for y on the device, copy from host to device before the region, and copy back to host after the region.

The goal is to express what is parallel and what data is needed, not how to launch GPU threads in detail.

Compute Regions: `parallel` vs `kernels`

OpenACC offers two primary ways to define code sections that will run on the accelerator:

parallel regions:

You, the programmer, state that a region is parallel and should be executed by many threads.
You control which loops are parallelized.
Often gives more predictable performance if you understand the parallelism.

kernels regions:

The compiler analyzes the code inside and automatically decides which loops can be parallelized and how.
Easier for first experiments, but may lead to more conservative parallelization.

Minimal examples (C):

#pragma acc parallel
{
    #pragma acc loop
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

or more compact:

#pragma acc parallel loop
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
}

Using kernels:

#pragma acc kernels
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
}

For a beginner, it’s common to start with parallel loop on clearly independent loops.

Loop Parallelism and the OpenACC Execution Model (High-Level)

On GPUs, work is executed by many lightweight threads organized in hierarchical groups. OpenACC abstracts this into three main levels of parallelism:

gang: Coarse-grained groups of work (loosely similar to thread blocks).
worker: Intermediate level of parallelism (typically used less explicitly by beginners).
vector: Fine-grained parallelism for SIMD-style execution (similar to threads within a block, or vector lanes).

In introductory usage, you often just write:

#pragma acc parallel loop
for (int i = 0; i < n; i++) {
    ...
}

and let the compiler choose how to map your loop iterations to gangs/workers/vectors.

Later, when you need more control, you can specify clauses:

#pragma acc parallel loop gang vector
for (int i = 0; i < n; i++) {
    ...
}

or even:

#pragma acc parallel
{
    #pragma acc loop gang
    for (int i = 0; i < Nx; i++) {
        #pragma acc loop vector
        for (int j = 0; j < Ny; j++) {
            ...
        }
    }
}

But initially, the conceptual takeaway is: annotate loops; the compiler maps them onto GPU parallelism for you.

Data Management: Basic Concepts and Clauses

To compute on the accelerator, data must be available on the device. OpenACC gives you explicit control over:

What is allocated on the device.
When data is copied between host and device.
How long device copies of data persist.

At a basic level, you manage this using:

Data clauses on compute directives: copy, copyin, copyout, create, present, deviceptr.
Data regions: #pragma acc data ... #pragma acc end data (or block-scoped form in C/C++).

The most common basic clauses:

copy(array[start:length]): Copy data from host to device at region entry, and back from device to host at region exit.
copyin(...): Copy from host to device only.
copyout(...): Copy from device to host only.
create(...): Allocate on device only, without copying any initial data.

Example with a data region:

// allocate and initialize a, b, c on host
#pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
{
    #pragma acc parallel loop
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
    // you can have multiple parallel loops using a, b, c here
} // data is synchronized at the end of the data region

Using a data region reduces repeated transfers: a, b, and c stay on the device for the entire region, even if there are multiple kernels inside.

Typical Directive Patterns for Scientific Loops

Below are common patterns you might use in numerical kernels.

Simple 1D loop

#pragma acc parallel loop copyin(x[0:n]) copy(y[0:n])
for (int i = 0; i < n; i++) {
    y[i] = a * x[i] + y[i];
}

Nested loops (2D stencil-style)

#pragma acc data copyin(old[0:Nx*Ny]) copyout(new[0:Nx*Ny])
{
    #pragma acc parallel loop collapse(2)
    for (int i = 1; i < Nx-1; i++) {
        for (int j = 1; j < Ny-1; j++) {
            int idx = i*Ny + j;
            new[idx] = 0.25f * (old[idx-1] + old[idx+1]
                              + old[idx-Ny] + old[idx+Ny]);
        }
    }
}

Here:

collapse(2) tells the compiler to treat the nested loops as a single larger iteration space, improving parallelism for large 2D problems.

Reductions

Many numerical codes have reductions (e.g., sums, max norms). OpenACC supports reduction clauses:

float sum = 0.0f;
#pragma acc parallel loop reduction(+:sum) copyin(a[0:n])
for (int i = 0; i < n; i++) {
    sum += a[i];
}

The compiler generates a parallel reduction on the device and gives you the final value back in sum when the region finishes.

Controlling Data Lifetime with `enter data` and `exit data`

For long-running applications or iterative solvers, you often want arrays to live on the device for a long time (e.g., many time steps), instead of copying them for every kernel.

In that case, you can use enter data and exit data directives outside your computational loops:

// allocate and initialize arrays on host
float *u = ...;
float *v = ...;
// create persistent copies on device
#pragma acc enter data copyin(u[0:n]) create(v[0:n])
for (int t = 0; t < nsteps; t++) {
    #pragma acc parallel loop present(u[0:n], v[0:n])
    for (int i = 0; i < n; i++) {
        v[i] = f(u[i]);
    }
    // optionally swap pointers on host and tell OpenACC:
    float *tmp = u; u = v; v = tmp;
    #pragma acc update device(u[0:n], v[0:n])
}
// get final result back and free device memory
#pragma acc exit data copyout(u[0:n]) delete(v[0:n])

Key ideas:

enter data and exit data decouple data management from individual kernels.
present asserts that the data is already on the device (and avoids extra copies).
update lets you synchronize data in the middle of a larger computation (e.g., host modifies some part that the device needs).

Basic Asynchronous Execution

GPU kernels often run asynchronously with respect to the CPU. OpenACC allows you to express this using the async and wait clauses.

Example:

#pragma acc data copyin(a[0:n]) copyout(b[0:n])
{
    #pragma acc parallel loop async(1)
    for (int i = 0; i < n; i++) {
        b[i] = f(a[i]);
    }
    // CPU can do other work here
    #pragma acc wait(1)
    // at this point, the GPU work associated with async(1) is complete
}

High-level points:

async(id) launches work on a given asynchronous queue (identified by id).
wait(id) blocks until the work on that queue finishes.
As asynchronous patterns get more complex (overlapping computation and data transfers), OpenACC lets you express these concisely with async/wait rather than low-level stream management.

Compiling and Running OpenACC Programs (Conceptual Overview)

Exact compiler flags and tools depend on your system and toolchain, but the general pattern is:

Use a compiler that supports OpenACC (e.g., nvc, nvfortran, some vendor or Clang/LLVM-based compilers with OpenACC support).
Enable OpenACC with a specific flag (e.g., -acc).
Optionally specify target GPU architecture and optimization levels.

Typical compile command (illustrative):

bash

nvc -acc -O3 -Minfo=accel -o saxpy saxpy.c

Where:

-acc enables OpenACC support.
-Minfo=accel asks the compiler to report what it parallelized and how.

On an HPC cluster, you typically:

Load the appropriate compiler module (via environment modules).
Compile with OpenACC flags.
Submit a batch job that requests GPU resources, and run the generated executable on nodes with GPUs.

The details of job submission, module loading, and resource requests are covered in other chapters; here, the key idea is that OpenACC code is compiled into binaries that target the GPU, then run like any other HPC application on GPU-equipped nodes.

Common Pitfalls for Beginners

When first using OpenACC, common issues include:

Forgetting data movement:

Writing #pragma acc parallel loop without data clauses or a data region, and then wondering why results are incorrect or slow.
Always think: “Where is my data now? Do I need to move it or is it already present?”

Parallelizing dependent loops:

Marking loops as parallel when loop iterations are not independent (e.g., recurrences, dependencies across i).
OpenACC assumes that iterations are safe to run in parallel unless you specify otherwise; if not, you may get wrong results.

Too small problem sizes:

Running very small loops on the GPU and concluding that OpenACC is “slow”.
GPUs need enough work to amortize kernel launch overhead and data transfers.

Overusing kernels without understanding:

Relying on kernels regions and expecting the compiler to find all parallelism automatically.
Sometimes parallel loop with explicit directives on the right loops yields better performance and clarity.

Unnecessary data transfers:

Repeatedly copying the same arrays in and out within tight loops, instead of using a persistent data region or enter data/exit data.

Strategy for Incrementally Porting Code with OpenACC

A practical way to start using OpenACC on an existing CPU code:

Profile on CPU to find the most time-consuming loops or regions.
Add minimal directives to one hotspot:

A simple parallel loop with copy/copyin/copyout clauses on arrays.

Verify correctness:

Check that results match CPU-only runs (within acceptable floating-point differences).

Measure performance:

Ensure that you are gaining speedup compared to CPU runs (with data transfer overhead accounted for).

Refine data movement:

Introduce data regions or enter data/exit data to reduce transfers.

Iterate:

Move on to the next hotspot, keeping the structure readable and maintainable.

Over time, you can introduce more advanced features (asynchronous execution, finer control over gang/vector mapping, etc.) as needed, but the initial steps focus on clear, independent loops and explicit data clauses.

How OpenACC Compares to Other GPU Programming Models (Conceptual)

OpenACC sits on a spectrum of GPU programming approaches:

Lower-level (e.g., CUDA, HIP, OpenCL):

Fine-grained control over threads, warps, memory hierarchy.
More code changes and complexity.
Often higher peak performance with expert tuning.

Directive-based (OpenACC, OpenMP target offload):

Less invasive changes to existing codes.
Higher-level description of parallelism and data.
Easier incremental adoption.

OpenACC’s niche:

Especially popular in Fortran and C scientific codes.
Attractive when porting large, legacy applications.
Often used on systems where vendor or site compilers have good OpenACC support.

Knowing this context helps you understand when OpenACC is an appropriate tool: when you value portability, incremental porting, and maintainability, and do not need to micromanage every aspect of GPU execution from the start.

Comments

Please login to add a comment.

Don't have an account? Register now!