11.1.1 GCC

Table of Contents

Overview

GCC, the GNU Compiler Collection, is one of the most widely used compilers in HPC. It is free, open source, and available on essentially every Linux based system. In HPC environments, GCC is often the default system compiler and also one of several compiler families offered through environment modules. This chapter focuses on what is specific to using GCC in high performance contexts, how it relates to standards, how to control it from the command line, and which options are particularly relevant for performance oriented work.

Language support and standards

GCC supports multiple programming languages that are central to HPC, in particular C, C++, and Fortran. Each language front end has its own compiler driver, usually invoked as gcc for C, g++ for C++, and gfortran for Fortran. These all share the same back end, which performs optimizations and code generation.

For HPC, standards conformance matters because scientific codes are often long lived and depend on specific language features. GCC tracks modern language standards quite closely. With appropriate flags you can select the desired standard level, for example -std=c11 or -std=c17 for C, -std=c++14, -std=c++17, or -std=c++20 for C++, and the corresponding Fortran standards such as -std=f2008 with gfortran. Older codes that predate strict standardization often rely on GNU extensions, so it is common to use -std=gnu11 or similar in C, which keeps extensions enabled.

In an HPC setting, you often need to combine codes from different eras. It is therefore important to know which language standard the code expects and to make GCC respect that choice consistently across all compilation units and libraries.

Rule: Always use an explicit -std= option with GCC for production builds of HPC codes, and keep this choice consistent across all source files and libraries.

GCC invocation and basic workflow

At its core, GCC follows the typical compile then link workflow. For C you might write:

gcc -c main.c -o main.o
gcc -c solver.c -o solver.o
gcc main.o solver.o -o mycode

The -c flag stops after compilation and produces object files. Omitting -c makes GCC perform the final link step, invoking the system linker with a set of default libraries. The same pattern applies to g++ and gfortran.

In HPC you will often compile and link separately for several reasons. Separate compilation improves incremental builds, permits different options per file, and is a prerequisite for some profiling and instrumentation tools. It also makes it easier to include object files from other compilers or precompiled vendor libraries. Build systems such as Make or CMake will use these same command line interfaces behind the scenes.

On many clusters, the compiler commands you see first in your shell correspond to system wide defaults. HPC centers typically provide additional GCC versions through modules named similar to gcc/11.3.0 or gcc/12.2.0. Loading such a module adjusts your PATH and library search paths so that the selected GCC version and its associated runtime libraries are used by default.

Optimization levels in GCC

GCC offers several levels of generic optimization, each activated by an -O flag. These levels trade compile time against runtime performance and are fundamental to performance tuning in HPC.

The main levels are:

-O0 disables most optimizations and is primarily useful for debugging. It makes compilation fast, but the resulting executables are often much slower and larger. Variables tend to map closely to source code, which helps debuggers.

-O1 turns on a basic set of optimizations without greatly increasing compile time. It is a reasonable default for development when you do not want debugging to be too hard but also do not want extremely slow code.

-O2 enables a broad range of optimizations and is a common default for production builds. It attempts to significantly improve performance while keeping compile times acceptable for everyday use.

-O3 enables more aggressive optimizations, including those that can change code structure to increase instruction level and loop level parallelism. This often helps numerical kernels, but can occasionally lead to larger binaries, higher memory use, or even performance regressions in some parts of the code.

-Ofast adds to -O3 a set of options that relax strict language rules, in particular with respect to floating point behavior and standards compliance. It can provide higher performance at the cost of potentially different numerical results.

Rule: Use at least -O2 for performance measurements, and test numerical sensitivity carefully before adopting -O3 or -Ofast in production HPC runs.

The choice between these levels depends on your code characteristics. Dense linear algebra or stencil codes may profit significantly from -O3 and -Ofast. Irregular codes or those that depend on subtle floating point properties may behave differently or even incorrectly if fast math assumptions are enabled. Validation and regression tests are essential whenever you change optimization levels.

Architecture specific tuning

Generic optimizations are not always enough for HPC workloads. GCC can tailor generated machine code to a specific CPU architecture or microarchitecture through flags like -march and -mtune.

The -march= option tells GCC which instruction set and architectural features it is allowed to use. For example, -march=native instructs GCC to detect the architecture of the build machine and generate code that uses all instructions and features supported there. On an x86_64 platform this can include AVX2 or AVX-512 instructions for vector operations.

The -mtune= option focuses on scheduling and microarchitectural tuning, while preserving compatibility with older instruction sets. For instance, you might compile with -march=x86-64 -mtune=skylake to keep code runnable on any x86_64 CPU, but tuned for a Skylake generation core.

On shared clusters, -march=native can be risky. If you compile on a newer node and run on an older one, the binary may use instructions that are not available on the target node, leading to illegal instruction errors. Many HPC centers specify a recommended -march or provide architecture specific modules such as cpu/rome or cpu/icelake that document the available instruction sets.

A typical performance oriented GCC invocation might look like:

gcc -O3 -march=native -ffast-math -fopenmp mycode.c -o mycode

This combination is very aggressive and should only be used after careful testing.

Rule: Match -march to the slowest CPU model you expect to run on, unless you deliberately build node specific executables for a homogeneous partition.

Vectorization with GCC

Automatic vectorization is crucial for high performance, since it allows the compiler to exploit SIMD units without manual intrinsics. In GCC, auto vectorization is enabled by default at -O3 and above, and in some cases at -O2 depending on the target.

You can explicitly enable vectorization with -ftree-vectorize and influence its aggressiveness using options like -fno-tree-vectorize to disable it or -fvect-cost-model=aggressive to push for more speculative vectorization.

The effectiveness of vectorization depends strongly on how loops are written. GCC must be able to prove that iterations are independent and that there are no harmful pointer aliasing relationships. The restrict keyword in C, and certain compiler hints or pragmas, can help GCC understand that pointers do not overlap. For Fortran, the language semantics already give strong aliasing guarantees, which is one reason Fortran codes often vectorize well.

To check what GCC is doing, you can ask for vectorization reports. Flags like -fopt-info-vec and -fopt-info-vec-optimized cause GCC to print information about which loops were vectorized and why. This is a practical starting point for loop level performance tuning.

Rule: Always inspect GCC vectorization reports when optimizing numerical kernels. They reveal missed opportunities and reasons, such as potential aliasing or unknown loop bounds.

OpenMP and multithreading support

For shared memory parallelism, GCC supports the OpenMP standard, which is the primary thread based model in many HPC codes. You enable OpenMP support using -fopenmp with gcc, g++, or gfortran. This option does two things at once. It instructs the compiler to recognize and translate OpenMP directives, and it links against the appropriate OpenMP runtime library.

When you compile with -fopenmp, GCC also defines some preprocessor macros, which you can use in your source code to conditionally include OpenMP specific parts. For example, in C you can test #ifdef _OPENMP to only compile parallel regions when OpenMP is available.

The OpenMP version supported depends on the GCC version. Newer GCC releases gradually adopt more recent OpenMP specifications. In practice, you may need a sufficiently recent GCC version to make use of advanced OpenMP features. HPC centers often document which OpenMP level each compiler module supports.

Because multithreaded execution interacts with optimization, you should use the same -O level and -fopenmp consistently for all compilation units that participate in OpenMP regions. Mixing different levels can sometimes lead to unpredictable performance or behavior, especially if inlining and interprocedural optimizations are used.

Debugging with GCC specific options

While another chapter focuses on debugging tools, it is useful to understand the GCC options that make debugging HPC codes practical. The primary flag is -g, which adds debug information to the generated binaries. This allows debuggers to map machine code back to source lines and variables.

For debug builds, a standard combination is -O0 -g. However, this can make certain HPC codes run too slowly to be practical, especially parallel ones. GCC supports -Og, a level that enables optimizations that do not interfere too heavily with debugging. Combining -Og -g often gives a good balance.

HPC numerical codes also often suffer from subtle issues such as uninitialized variables. GCC can help detect these at compile time and runtime through warnings and sanitizers. Enabling extra warnings with -Wall -Wextra is a good start. Runtime sanitizers like -fsanitize=address or -fsanitize=undefined can be powerful for catching memory errors and undefined behavior, though they increase runtime overhead substantially.

There is an important trade off between aggressive optimizations and debuggability. Higher optimization levels, especially with -Ofast, can reorder operations, remove variables, and inline functions deeply. This makes debugging more difficult because observed behavior may not follow the source code structure closely. For this reason, it is common to maintain separate debug and release configurations in an HPC project, even when using build systems.

Linking libraries with GCC

Most HPC applications depend on external libraries, such as numerical libraries or MPI implementations. With GCC, you control linking through a combination of -L to specify library search paths and -l to select specific libraries by name.

For example, to link against a BLAS library located in /opt/blas/lib, you might write:

gcc main.o solver.o -L/opt/blas/lib -lblas -o mycode

The order of libraries on the command line matters. With the traditional Unix linker that GCC calls, a library can only resolve symbols that have been referenced by object files or previous libraries in the same command line. In HPC builds with many dependencies, incorrect ordering can cause unresolved reference errors.

HPC clusters often provide wrapper compilers such as mpicc, mpicxx, or mpifort, which internally call GCC or another compiler. These wrappers automatically add the correct MPI include paths and libraries. When you use GCC directly, you must add the corresponding -I, -L, and -l options yourself. It is usually simpler and more robust to use the MPI wrapper compilers with GCC as the underlying compiler, especially in multi node codes.

For C++, use g++ rather than gcc when linking, even if all your object files are already compiled, because g++ automatically links against the C++ standard library. Similarly, gfortran adds the Fortran runtime libraries.

Rule: Always use the language specific GCC driver (g++, gfortran) for linking C++ or Fortran HPC applications, so that the correct runtime libraries are included automatically.

GCC versions on clusters

Different GCC versions can generate binaries with different performance characteristics and support different language and OpenMP features. On HPC systems, it is common to have multiple GCC versions side by side. These are usually provided via environment modules, and each module is often used as the basis for a matching software stack.

From a user perspective, this means two things. First, you should choose a GCC version that is compatible with the libraries and tools you plan to use. Second, you should document which version you used, so that results are reproducible and colleagues can rebuild the same code.

Because GCC evolves, performance can vary significantly across versions even with the same flags. For critical kernels, it is sometimes worth benchmarking with several GCC releases that are available on your system. Some HPC centers also provide non GNU compilers, such as Intel or LLVM based ones, and you may compare these as well. The key is to keep the rest of the configuration constant while you change only the compiler.

Position of GCC within HPC toolchains

In typical HPC environments, GCC plays several roles at once. It is a general purpose compiler for user codes written in C, C++, or Fortran. It is also the base compiler used to build many open source libraries and frameworks, from MPI implementations to numerical libraries and analysis tools.

Vendor optimized libraries are often compatible with GCC compiled applications, but compatibility can depend on ABI details such as the C++ standard library version. When mixing GCC compiled code with third party libraries, especially C++ ones, you must ensure that the ABI and standard library versions match or are known to interoperate. For pure C or Fortran libraries, this is usually less of a problem.

Because of its wide adoption, knowledge of GCC flags and behavior is transferable between systems and projects. Once you are familiar with tuning GCC on one cluster, you can often carry that experience over to others, adjusting only for differences in CPU architecture and library availability.

Comments

Please login to add a comment.

Don't have an account? Register now!