7.1 Introduction to OpenMP

What OpenMP Is (and Isn’t)

OpenMP is a specification for shared‑memory parallel programming, mainly for C, C++, and Fortran. It lets you express parallelism using:

Compiler directives (pragmas in C/C++, comments in Fortran)
A small runtime library
A few environment variables

Key characteristics:

Shared-memory model: multiple threads share the same address space.
Incremental parallelism: you can parallelize code by adding directives to existing loops and regions without rewriting logic.
Compiler-based: the compiler interprets directives like #pragma omp parallel and generates threaded code.

OpenMP is not:

A separate programming language.
A distributed-memory (multi-node) model (that is covered by MPI).
Tied to a specific OS or vendor—it's a portable standard, but you must compile with OpenMP support enabled.

Basic Compilation and Enabling OpenMP

To use OpenMP, you need:

A compiler that supports OpenMP (GCC, Clang/LLVM, Intel, etc.).
A compile-time flag to enable it.

Typical examples:

GCC / Clang (C/C++): -fopenmp
GCC / GFortran (Fortran): -fopenmp
Intel (classic ICC/IFORT or oneAPI ICX/IFX): -qopenmp (or -fiopenmp depending on version)

Example (C):

gcc -O2 -fopenmp -o myprog myprog.c

If you omit the OpenMP flag, most compilers will ignore the directives and produce a normal serial executable.

Basic OpenMP Syntax and Structure

Directives in C/C++

OpenMP directives in C/C++ use #pragma:

General form: #pragma omp <construct> [clauses]

Examples:

#pragma omp parallel
#pragma omp for
#pragma omp parallel for
Clauses such as num_threads, private, shared, reduction, etc.

Directives must appear in the same line as the keyword; they apply to the succeeding structured block (for example, the next statement or loop).

Directives in Fortran (brief orientation)

In Fortran, directives are special comments, e.g.:

!$omp parallel
!$omp do
Closed with !$omp end parallel, !$omp end do, etc.

The syntax details and style are Fortran-specific, but the conceptual constructs mirror those of C/C++.

Your First OpenMP Program

Below is a minimal C example showing a “Hello, world” using OpenMP:

#include <stdio.h>
#include <omp.h>
int main(void) {
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        printf("Hello from thread %d\n", tid);
    }
    return 0;
}

Key points:

#pragma omp parallel creates a team of threads.
The code block following the pragma is executed by each thread.
omp_get_thread_num() returns the thread ID (0, 1, 2, …).

Compile with:

gcc -O2 -fopenmp hello_omp.c -o hello_omp

Run:

./hello_omp

You should see multiple lines, one per thread. The number of threads is controlled by runtime settings (see below).

Controlling the Number of Threads

OpenMP lets you specify the team size at runtime or via code. Common mechanisms:

Environment variable: `OMP_NUM_THREADS`

Set before running your program:

export OMP_NUM_THREADS=4
./hello_omp

On typical HPC systems, you will often choose this to match the number of physical cores you want to use on a node.

In-code control: `num_threads` clause

You can override the environment variable in specific parallel regions:

#pragma omp parallel num_threads(8)
{
    // this region uses 8 threads
}

Runtime function calls

You can query and set the thread count programmatically:

#include <omp.h>
int main(void) {
    omp_set_num_threads(4);       // request 4 threads for subsequent regions
    #pragma omp parallel
    {
        int tid  = omp_get_thread_num();
        int nthr = omp_get_num_threads();
        // ...
    }
}

On HPC clusters, the number of threads is often coordinated with the job scheduler and the number of cores you requested (discussed elsewhere).

OpenMP Execution Model: Teams and Threads

Basic model:

The program starts with a single thread (the master thread).
When you encounter a parallel construct, the master thread spawns a team of threads.
All threads in the team execute the code inside the parallel region.
At the end of the region, threads synchronize and most terminate; the master continues sequentially.

Conceptually:

Fork: Enter #pragma omp parallel → threads are created.
Join: Exit #pragma omp parallel → threads synchronize and join.

Inside a parallel region:

You can get the thread ID with omp_get_thread_num().
You can get the team size with omp_get_num_threads().

Basic Data-Sharing Concepts (OpenMP View)

OpenMP builds on the shared-memory model: all threads see the same address space, but variables can be:

Shared: all threads see the same variable (default for most global/static variables, and heap allocations).
Private: each thread gets its own copy (e.g., loop indices, temporary variables).

OpenMP lets you control this via clauses:

shared(list)
private(list)
firstprivate(list)
lastprivate(list)
reduction(op:list)

Example skeleton (details of clauses are used more extensively in later sections):

#pragma omp parallel shared(a, b) private(i, tmp)
{
    // 'a' and 'b' are visible and shared by all threads
    // each thread has its own 'i' and 'tmp'
}

For simple “hello world” or basic loops, default behavior often works, but correct data scoping becomes crucial as programs become more complex.

Parallelizing Loops with OpenMP

One of OpenMP’s main strengths is easy parallelization of regular loops over independent iterations.

The `parallel for` construct (C/C++)

A common pattern:

#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}

Meaning:

OpenMP creates a team of threads.
Iterations of the loop are divided among threads.
Each thread executes its assigned iterations.

Requirements for correctness:

Each iteration must be independent (no data dependencies between different i values that would require ordering).

Equivalent two-step form:

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < N; i++) {
        a[i] = b[i] + c[i];
    }
}

This makes it clear that parallel creates threads and for distributes iterations. In many cases, parallel for is a convenient shorthand.

Loop scheduling (introductory view)

OpenMP supports different schedules for assigning loop iterations to threads via the schedule clause:

schedule(static)
schedule(dynamic)
schedule(guided)

Example:

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
    // ...
}

The choice of schedule affects load balancing and performance but does not change the loop’s final results if the loop is correctly parallelizable.

Reductions: Parallel Accumulations

Many loops perform accumulations (sum, min, max, etc.). OpenMP provides the reduction clause so that each thread safely contributes to a shared result.

Example: summation

double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += a[i];
}

What happens:

Each thread maintains a private partial sum.
After the loop, OpenMP combines all partial sums with the + operator.
The final sum is available in the master thread.

Support operators commonly include +, *, -, &, |, ^, &&, ||, and others depending on the type.

This is the preferred way to handle patterns like sums and dot products in OpenMP.

Basic OpenMP Runtime Library Routines

OpenMP provides a small, portable runtime API. The most commonly used routines for beginners are:

int omp_get_num_threads(void);

Number of threads in the current team.

int omp_get_thread_num(void);

ID of the calling thread (0 to num_threads-1).

void omp_set_num_threads(int nthreads);

Request a number of threads for subsequent parallel regions.

int omp_get_max_threads(void);

Maximum number of threads that could be used in a parallel region.

double omp_get_wtime(void);

Wall-clock timer useful for timing code sections.

Example timing code:

double t0 = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    // work
}
double t1 = omp_get_wtime();
printf("Elapsed time: %f seconds\n", t1 - t0);

Note: omp_get_wtime() measures wall-clock time and is independent of the underlying platform.

Environment Variables for Runtime Control

OpenMP behavior can be tuned without recompiling, using environment variables:

OMP_NUM_THREADS

Controls number of threads in parallel regions (if not overridden in code).

OMP_DYNAMIC

Allows the runtime to adjust the number of threads (TRUE/FALSE).

OMP_PROC_BIND

Controls thread binding to cores (affects affinity and performance).

OMP_PLACES

Specifies which hardware resources (cores, sockets, etc.) threads can run on.

OMP_SCHEDULE

Sets default loop schedule for for loops using schedule(runtime).

Examples:

export OMP_NUM_THREADS=8
export OMP_SCHEDULE="dynamic,4"
./myprog

On HPC systems, these variables are often combined with scheduler settings (e.g., number of cores per task) to achieve good performance.

Typical Usage Patterns in HPC Codes

Within HPC applications, OpenMP is frequently used for:

Loop-level parallelism: Parallelizing compute-intensive loops in numerical kernels (matrix operations, stencil computations, etc.).
Region-level parallelism: Wrapping larger blocks of code where tasks can be performed concurrently.
Hybrid programming: Combining OpenMP (within a node) with MPI (across nodes), where each MPI process runs multiple OpenMP threads.

Conceptually common patterns:

MPI rank per node (or per socket) + OpenMP threads per rank to use all cores.
OpenMP around compute kernels, leaving I/O and communication mostly serial.

The details of hybrid strategies are covered elsewhere; here the key idea is that OpenMP targets intra-node, shared-memory parallelism.

Minimal Debugging and Safety Tips for Beginners

When starting with OpenMP:

Begin from a working serial code.
Parallelize one loop or region at a time.
Verify correctness after each change before optimizing performance.

Some early safeguards:

Use the reduction clause instead of manually updating shared accumulators.
Make loop indices private (OpenMP often does this automatically, but being explicit can be clearer).
Avoid writing to the same array element from multiple iterations unless you understand the implications.

For initial experiments, you can add temporary checks inside parallel regions:

#pragma omp parallel
{
    int tid = omp_get_thread_num();
    printf("Thread %d is running here\n", tid);
}

Be aware that printf from multiple threads can interleave output, but this helps verify that threads are created and running.

Summary

In this chapter you learned:

OpenMP is a directive-based shared-memory parallel programming model for C, C++, and Fortran.
You enable it at compile time (e.g., with -fopenmp) and use pragmas such as #pragma omp parallel and #pragma omp parallel for.
Threads share memory, but OpenMP provides mechanisms to control data sharing and safely perform reductions.
Runtime behavior (number of threads, scheduling, affinity) can be adjusted with environment variables and API calls.
OpenMP is especially suited to loop-based parallelism on multicore CPUs, and it is a foundational tool for node-level parallelism in HPC.

Comments

Please login to add a comment.

Don't have an account? Register now!