Kahibaro
Discord Login Register

Introduction to OpenACC

What OpenACC Is and When to Use It

OpenACC is a high-level, directive-based programming model designed to make it easier to run existing CPU code on GPUs and other accelerators. Instead of writing explicit GPU kernels (as in CUDA), you add hints (pragmas/directives) to your C, C++, or Fortran code that tell the compiler which parts to offload and how to manage data.

Key characteristics:

Typical use cases:

Basic Programming Model and Directives

OpenACC introduces three closely related ideas:

In C/C++, directives take the form:

c
#pragma acc <keyword> [clauses...]

In Fortran:

fortran
!$acc <keyword> [clauses...]

The most important high-level directives to know initially are:

At an introductory level, you typically start with parallel loop and a few data clauses, then refine from there.

First Example: Offloading a Simple Loop

Consider a plain C loop that performs a SAXPY operation ($y = a x + y$):

c
void saxpy(int n, float a, float *x, float *y)
{
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

A first OpenACC version to offload this to the GPU might look like:

c
void saxpy(int n, float a, float *x, float *y)
{
    #pragma acc parallel loop copyin(x[0:n]) copy(y[0:n])
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

What this says conceptually:

The goal is to express what is parallel and what data is needed, not how to launch GPU threads in detail.

Compute Regions: `parallel` vs `kernels`

OpenACC offers two primary ways to define code sections that will run on the accelerator:

Minimal examples (C):

c
#pragma acc parallel
{
    #pragma acc loop
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

or more compact:

c
#pragma acc parallel loop
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
}

Using kernels:

c
#pragma acc kernels
for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
}

For a beginner, it’s common to start with parallel loop on clearly independent loops.

Loop Parallelism and the OpenACC Execution Model (High-Level)

On GPUs, work is executed by many lightweight threads organized in hierarchical groups. OpenACC abstracts this into three main levels of parallelism:

In introductory usage, you often just write:

c
#pragma acc parallel loop
for (int i = 0; i < n; i++) {
    ...
}

and let the compiler choose how to map your loop iterations to gangs/workers/vectors.

Later, when you need more control, you can specify clauses:

c
#pragma acc parallel loop gang vector
for (int i = 0; i < n; i++) {
    ...
}

or even:

c
#pragma acc parallel
{
    #pragma acc loop gang
    for (int i = 0; i < Nx; i++) {
        #pragma acc loop vector
        for (int j = 0; j < Ny; j++) {
            ...
        }
    }
}

But initially, the conceptual takeaway is: annotate loops; the compiler maps them onto GPU parallelism for you.

Data Management: Basic Concepts and Clauses

To compute on the accelerator, data must be available on the device. OpenACC gives you explicit control over:

At a basic level, you manage this using:

The most common basic clauses:

Example with a data region:

c
// allocate and initialize a, b, c on host
#pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n])
{
    #pragma acc parallel loop
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
    // you can have multiple parallel loops using a, b, c here
} // data is synchronized at the end of the data region

Using a data region reduces repeated transfers: a, b, and c stay on the device for the entire region, even if there are multiple kernels inside.

Typical Directive Patterns for Scientific Loops

Below are common patterns you might use in numerical kernels.

Simple 1D loop

c
#pragma acc parallel loop copyin(x[0:n]) copy(y[0:n])
for (int i = 0; i < n; i++) {
    y[i] = a * x[i] + y[i];
}

Nested loops (2D stencil-style)

c
#pragma acc data copyin(old[0:Nx*Ny]) copyout(new[0:Nx*Ny])
{
    #pragma acc parallel loop collapse(2)
    for (int i = 1; i < Nx-1; i++) {
        for (int j = 1; j < Ny-1; j++) {
            int idx = i*Ny + j;
            new[idx] = 0.25f * (old[idx-1] + old[idx+1]
                              + old[idx-Ny] + old[idx+Ny]);
        }
    }
}

Here:

Reductions

Many numerical codes have reductions (e.g., sums, max norms). OpenACC supports reduction clauses:

c
float sum = 0.0f;
#pragma acc parallel loop reduction(+:sum) copyin(a[0:n])
for (int i = 0; i < n; i++) {
    sum += a[i];
}

The compiler generates a parallel reduction on the device and gives you the final value back in sum when the region finishes.

Controlling Data Lifetime with `enter data` and `exit data`

For long-running applications or iterative solvers, you often want arrays to live on the device for a long time (e.g., many time steps), instead of copying them for every kernel.

In that case, you can use enter data and exit data directives outside your computational loops:

c
// allocate and initialize arrays on host
float *u = ...;
float *v = ...;
// create persistent copies on device
#pragma acc enter data copyin(u[0:n]) create(v[0:n])
for (int t = 0; t < nsteps; t++) {
    #pragma acc parallel loop present(u[0:n], v[0:n])
    for (int i = 0; i < n; i++) {
        v[i] = f(u[i]);
    }
    // optionally swap pointers on host and tell OpenACC:
    float *tmp = u; u = v; v = tmp;
    #pragma acc update device(u[0:n], v[0:n])
}
// get final result back and free device memory
#pragma acc exit data copyout(u[0:n]) delete(v[0:n])

Key ideas:

Basic Asynchronous Execution

GPU kernels often run asynchronously with respect to the CPU. OpenACC allows you to express this using the async and wait clauses.

Example:

c
#pragma acc data copyin(a[0:n]) copyout(b[0:n])
{
    #pragma acc parallel loop async(1)
    for (int i = 0; i < n; i++) {
        b[i] = f(a[i]);
    }
    // CPU can do other work here
    #pragma acc wait(1)
    // at this point, the GPU work associated with async(1) is complete
}

High-level points:

Compiling and Running OpenACC Programs (Conceptual Overview)

Exact compiler flags and tools depend on your system and toolchain, but the general pattern is:

Typical compile command (illustrative):

bash
nvc -acc -O3 -Minfo=accel -o saxpy saxpy.c

Where:

On an HPC cluster, you typically:

  1. Load the appropriate compiler module (via environment modules).
  2. Compile with OpenACC flags.
  3. Submit a batch job that requests GPU resources, and run the generated executable on nodes with GPUs.

The details of job submission, module loading, and resource requests are covered in other chapters; here, the key idea is that OpenACC code is compiled into binaries that target the GPU, then run like any other HPC application on GPU-equipped nodes.

Common Pitfalls for Beginners

When first using OpenACC, common issues include:

Strategy for Incrementally Porting Code with OpenACC

A practical way to start using OpenACC on an existing CPU code:

  1. Profile on CPU to find the most time-consuming loops or regions.
  2. Add minimal directives to one hotspot:
    • A simple parallel loop with copy/copyin/copyout clauses on arrays.
  3. Verify correctness:
    • Check that results match CPU-only runs (within acceptable floating-point differences).
  4. Measure performance:
    • Ensure that you are gaining speedup compared to CPU runs (with data transfer overhead accounted for).
  5. Refine data movement:
    • Introduce data regions or enter data/exit data to reduce transfers.
  6. Iterate:
    • Move on to the next hotspot, keeping the structure readable and maintainable.

Over time, you can introduce more advanced features (asynchronous execution, finer control over gang/vector mapping, etc.) as needed, but the initial steps focus on clear, independent loops and explicit data clauses.

How OpenACC Compares to Other GPU Programming Models (Conceptual)

OpenACC sits on a spectrum of GPU programming approaches:

OpenACC’s niche:

Knowing this context helps you understand when OpenACC is an appropriate tool: when you value portability, incremental porting, and maintainability, and do not need to micromanage every aspect of GPU execution from the start.

Views: 11

Comments

Please login to add a comment.

Don't have an account? Register now!