7.1 Introduction to OpenMP

Table of Contents

Context and Purpose of OpenMP

OpenMP is a standard for writing shared memory parallel programs in C, C++, and Fortran. It is designed to make it easy to add multithreading to existing code without a complete rewrite. Instead of calling many explicit threading library functions, you keep your program mostly as is and insert OpenMP directives that tell the compiler where and how to run code in parallel.

In the shared memory model, all threads of a program can access the same variables in memory. OpenMP assumes this model and builds on it. As a result, OpenMP is particularly useful on single nodes of an HPC cluster, where multiple cores share the same physical memory.

The role of this chapter is to introduce OpenMP as a practical tool, show the basic syntax, and illustrate how to turn a simple serial program into a parallel one. More advanced aspects such as detailed thread management, work sharing constructs, and synchronization will be treated in later chapters of this section.

Programming Model and Basic Ideas

OpenMP follows a fork-join execution model. Your program begins as a single thread. At specific places that you mark, this thread forks a team of threads that execute a region of code in parallel. Once the parallel work finishes, the threads join back into one thread and execution continues serially.

In code, you do not create and destroy threads manually. Instead, you mark regions where OpenMP should create a team. The OpenMP runtime and the compiler handle the details of mapping threads to cores, scheduling, and synchronization.

Several concepts appear repeatedly throughout OpenMP programs:

A team is a group of threads that execute a parallel region together.

The master thread is the original thread that begins program execution and by default has thread number 0 within a team.

Each thread has a thread ID, an integer in the range from 0 to $N - 1$ if there are $N$ threads in the team.

OpenMP also distinguishes between variables that are shared among threads and variables that are private to each thread. Controlling the sharing attributes of variables is central to correct and efficient OpenMP programs, and is closely related to race conditions and synchronization, which you will see in later chapters.

Language Support and Basic Syntax

OpenMP is not a separate language. It is a set of extensions to C, C++, and Fortran, consisting of:

Compiler directives that you insert as pragmas in C and C++, and as special comments or directives in Fortran.

Runtime library routines that you can call to query and control the OpenMP environment, such as the number of threads.

Environment variables that influence execution without recompiling, for example to choose how many threads to use.

In C and C++, OpenMP directives use the #pragma mechanism. The general form is:

#pragma omp directive-name [clauses]
    structured-block

The structured-block is usually a single statement or a compound statement enclosed in braces. The compiler recognizes #pragma omp lines only when OpenMP is enabled during compilation.

In Fortran, OpenMP uses a different syntax, but the ideas are the same. The precise Fortran forms vary between fixed and free form source and will be covered only briefly here.

Because OpenMP is implemented as compiler extensions, the compiler has to be told explicitly to enable it. This happens with specific flags for different compilers, as described later in this chapter.

Enabling OpenMP in Common Compilers

To use OpenMP, you must do two things: include the appropriate header or module in your code, and compile with OpenMP support turned on.

In C and C++, include the header:

#include <omp.h>

This gives access to the OpenMP library functions and constants.

In Fortran, you typically use one of:

fortran

use omp_lib
! or
include 'omp_lib.h'

To enable OpenMP during compilation, use the correct compiler option. The exact flag depends on the compiler family:

For GCC, including gcc and g++, use:

bash

gcc -fopenmp program.c -o program
g++ -fopenmp program.cpp -o program

For the Intel oneAPI compilers, use:

bash

icx -qopenmp program.c -o program
ifx -qopenmp program.f90 -o program

For LLVM clang, support depends on configuration, but a common form is:

bash

clang -fopenmp program.c -o program

Compiling without these flags usually ignores #pragma omp directives. This gives an important property: the code remains valid serial code. You can compile and run it on systems without OpenMP support, albeit without parallel execution.

Always compile with the OpenMP flag when you want parallel execution. Compiling without it will silently disable OpenMP directives and produce a serial program.

The First Parallel Region

The most fundamental OpenMP construct is the parallel directive. It marks a block of code that should run in multiple threads.

In C or C++, a minimal example looks like:

#include <stdio.h>
#include <omp.h>
int main(void) {
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        printf("Hello from thread %d of %d\n", id, nthreads);
    }
    return 0;
}

If you compile and run this with OpenMP enabled, you should see multiple lines of output, one from each thread. The order of the lines can vary, because threads run concurrently.

The #pragma omp parallel line instructs the compiler to create a team of threads. Each thread then executes the following block. When all threads finish that block, they synchronize and the program continues after the block with a single thread.

Two basic library functions appear in this example:

omp_get_thread_num() returns the ID of the calling thread, an integer label starting at 0.

omp_get_num_threads() returns the total number of threads in the current team.

Inside the parallel region, each thread may execute the same code on different data. OpenMP provides ways to divide loop iterations among threads, to reduce values, and to control which variables are shared. These are part of the work-sharing constructs and synchronization topics in later chapters.

Controlling the Number of Threads

OpenMP allows you to specify how many threads to use for parallel regions. There are several ways to do this. You can call a runtime function in your program, you can set an environment variable, or you can use a clause on the parallel directive.

To set the number of threads in code, use:

#include <omp.h>
int main(void) {
    omp_set_num_threads(4);
    #pragma omp parallel
    {
        /* parallel work */
    }
    return 0;
}

This instructs the runtime to create up to 4 threads in subsequent parallel regions, subject to any system limits.

You can also place a num_threads clause directly on a parallel region:

#pragma omp parallel num_threads(8)
{
    /* code here runs with 8 threads if possible */
}

In many HPC environments, it is more common to control the number of threads from the outside through the environment variable OMP_NUM_THREADS. For example, in a shell:

bash

export OMP_NUM_THREADS=16
./program

This approach is especially convenient on clusters, because you can adjust the thread count to match the cores allocated to your job script without recompiling.

The number of OpenMP threads should match the CPU cores assigned to your job on a node. Oversubscribing by using more threads than cores usually hurts performance and can create interference with other jobs.

The Basic Parallel For Loop

A common use of OpenMP is to parallelize loops that process many independent iterations. The for directive in C and C++, and the corresponding do directive in Fortran, splits the iterations of a loop among the threads in a team.

In C, the pattern is:

#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}

This has two effects. First, a team of threads is created, as with parallel. Second, the total iteration space from 0 to N - 1 is partitioned into smaller chunks, and each thread executes a subset of the iterations.

In this form, parallel and for are combined into a single directive. For more complex cases, they can be separated and combined with other constructs, but for an introduction this combined form is often enough to see immediate speedups.

Fortran has an analogous form:

fortran

!$omp parallel do
do i = 1, N
    a(i) = b(i) + c(i)
end do
!$omp end parallel do

Here, the source comment !$omp is interpreted as an OpenMP directive when compiled with OpenMP support. Otherwise, it is treated as a normal comment, which again preserves valid serial behavior.

For a loop to be safely parallelized in this way, iterations should be independent. That means computing a[i] must not depend on results from earlier iterations in a way that would break if iterations run out of order. Detecting such dependencies and handling them correctly is part of parallel programming practice and connects to race conditions and synchronization in later chapters.

Variable Scoping: Shared and Private

In a parallel region, all threads see the same memory, but not every variable behaves the same way. Some variables are shared among all threads, others should be private to each thread. OpenMP allows you to control this as part of the directive.

By default, in C and C++, most global variables and file scope variables are shared. Variables declared inside a parallel region are often private to each thread, but the full default behavior can depend on the compiler and configuration.

You can specify scoping explicitly with clauses such as shared and private. For example:

int i;
double sum = 0.0;
#pragma omp parallel for private(i) shared(sum)
for (i = 0; i < N; i++) {
    /* use i privately, see sum as shared */
}

Here, each thread gets its own private i variable, which prevents threads from interfering with each other when updating the loop index. The variable sum is shared, so all threads refer to the same memory location.

Without careful control of shared and private variables, you can easily create race conditions, where the outcome depends on the timing of memory accesses in different threads. The concepts of race conditions and the synchronization constructs to avoid them form a separate topic, but it is important to recognize that variable scoping is a central part of OpenMP correctness.

Always reason about which variables must be shared and which must be private in a parallel region. Incorrect scoping is a common source of subtle and hard to debug errors in OpenMP programs.

Basic Reductions

Many parallel loops accumulate a result, such as a sum or a maximum. If a variable like sum is shared, and each thread executes sum += value;, then all threads modify the same memory location concurrently. This leads to a race condition and incorrect results.

OpenMP provides a reduction clause that handles this pattern safely. With reduction, each thread gets a private copy of the variable, and at the end of the parallel region these private copies are combined using the specified operation.

For example, to compute a sum:

double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += a[i];
}

The reduction(+:sum) clause states that sum should participate in a reduction with the + operator. Each thread computes a partial sum, and after the loop finishes, the runtime combines the partial sums into the final value of sum.

Other common operators include * for products, max and min for extrema, and logical operators for Boolean reductions. The details and performance implications of reductions are explored further when discussing work-sharing and performance considerations.

OpenMP Runtime Library Basics

Beyond directives, OpenMP defines a runtime library available through functions like omp_get_thread_num and omp_set_num_threads. These allow your program to query and control aspects of the parallel execution.

Some of the most commonly used functions in C and C++ include:

int omp_get_thread_num(void); returns the ID of the calling thread in the current team.

int omp_get_num_threads(void); returns the number of threads in the current team.

int omp_get_max_threads(void); returns the maximum number of threads that could be used in a parallel region.

void omp_set_num_threads(int nthreads); requests that future parallel regions use nthreads threads.

double omp_get_wtime(void); returns a wall clock time in seconds, which is useful for simple timing measurements.

A typical timing example looks like:

double t0 = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    /* work */
}
double t1 = omp_get_wtime();
printf("Elapsed time: %f seconds\n", t1 - t0);

This approach gives a quick way to compare serial and parallel sections, and to check that enabling OpenMP actually reduces the runtime for sufficiently large problems.

Use omp_get_wtime for measuring OpenMP code sections, not the CPU time of a single thread. Wall clock timing better reflects the overall performance seen by your application.

Environment Variables and Execution Control

In addition to directives and library calls, OpenMP can be influenced by environment variables, which you set in the shell before running your program. These variables provide a convenient way to experiment with different configurations, especially on HPC systems where you may not want to modify the source code for every run.

Some key environment variables include:

OMP_NUM_THREADS sets the number of threads to use for parallel regions, as discussed earlier.

OMP_SCHEDULE influences how iterations are distributed among threads for certain loop schedules, which becomes relevant when you study work sharing and load balancing.

OMP_PROC_BIND and OMP_PLACES can control how threads are bound to cores, which affects performance and interaction with the operating system scheduler.

For example:

bash

export OMP_NUM_THREADS=8
export OMP_SCHEDULE="dynamic,4"
./program

would run the program with 8 threads and a dynamic schedule with a chunk size of 4 for supported loops. Details of scheduling policies belong in later discussions of performance considerations, but it is useful to know that OpenMP offers this level of control without requiring recompilation.

Compilation, Portability, and Standard Versions

OpenMP is standardized by an open consortium. The version of OpenMP supported by your compiler can affect which features are available. For instance, early versions focus on basic parallel regions and loops, while later versions add more complex task constructs and better support for accelerators and modern C++.

Different compilers and different versions of the same compiler might implement different OpenMP versions. You can usually find out which version is supported from the compiler documentation or by checking predefined macros in C and C++.

One key property of OpenMP is that it is largely portable across platforms and compilers. A program that uses only basic constructs like parallel and parallel for will typically compile and run on any system with a conforming OpenMP-capable compiler.

Because OpenMP uses preprocessor directives and comments that are ignored by non OpenMP compilers, the same source code can serve both as serial and parallel code.

Rely on standardized OpenMP features whenever possible. Vendor specific extensions can reduce portability and may behave differently across systems.

Putting It All Together: A Simple Example

To summarize, consider a simple OpenMP program that adds two vectors and computes the sum of the result, using parallel loops and a reduction.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[]) {
    int N = 1000000;
    if (argc > 1) {
        N = atoi(argv[1]);
    }
    double *a = malloc(N * sizeof(double));
    double *b = malloc(N * sizeof(double));
    double *c = malloc(N * sizeof(double));
    for (int i = 0; i < N; i++) {
        a[i] = 1.0;
        b[i] = 2.0;
    }
    double t0 = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < N; i++) {
        sum += c[i];
    }
    double t1 = omp_get_wtime();
    printf("Sum = %f\n", sum);
    printf("Elapsed time = %f seconds\n", t1 - t0);
    free(a);
    free(b);
    free(c);
    return 0;
}

This program illustrates several core OpenMP ideas:

Use of #include <omp.h> and -fopenmp or equivalent to enable OpenMP.

A parallel loop that divides work across threads.

A reduction that safely combines partial results.

Simple timing with omp_get_wtime.

In practice, you would compare the runtime with and without OpenMP, and experiment with OMP_NUM_THREADS to see how performance scales with the number of cores. As you proceed through the following chapters, you will learn how to control threads more precisely, how work is shared, how to avoid race conditions, and how to tune OpenMP programs for high performance on real HPC systems.

Comments

Please login to add a comment.

Don't have an account? Register now!

7.1 Introduction to OpenMP

Context and Purpose of OpenMP

Programming Model and Basic Ideas

Language Support and Basic Syntax

Enabling OpenMP in Common Compilers

The First Parallel Region

Controlling the Number of Threads

The Basic Parallel For Loop

Variable Scoping: Shared and Private

Basic Reductions

OpenMP Runtime Library Basics

Environment Variables and Execution Control

Compilation, Portability, and Standard Versions

Putting It All Together: A Simple Example

Comments

Where to Move