11.4 Optimized builds

Table of Contents

What “optimized build” means in HPC

An optimized build is a version of your program compiled with flags and settings that aim for maximum performance on a target machine, often at the cost of:

Longer compile times
Larger binaries
Less helpful debugging information
Reduced portability (tuned for a specific CPU/GPU)

In HPC, “optimized” usually means:

Enabling high levels of compiler optimization (-O2, -O3, etc.)
Turning on vectorization and architecture-specific tuning
Choosing math library and floating‑point trade‑offs consciously
Using link‑time optimization and profile‑guided optimization when useful

This chapter focuses on how to build such binaries, not on all the underlying performance theory (covered in performance-related chapters).

Typical optimization levels

Most compilers follow a similar hierarchy; details differ by compiler, but conceptually:

-O0

No optimization. Direct translation of code, fastest compilation. Use for debugging only.

-O1

Light optimizations, generally safe, still relatively quick to compile.

-O2

“Workhorse” optimization level. Most HPC codes use at least this. Good balance of speed and compile time.

-O3

Aggressive optimizations that may significantly change code structure. Often faster, but not always; may increase memory usage or cause performance regressions in some cases.

-Ofast (compiler-specific)

Like -O3 plus unsafe math/standards relaxations (e.g., ignoring strict IEEE/ISO semantics). May break strict numerical reproducibility or correctness in edge cases.

On GCC/Clang, for example:

-O2 is usually a safe default for production.
-O3 / -Ofast should be tested carefully against correctness and performance.

On Intel and other vendor compilers, names differ but the ideas are similar (e.g. Intel -O2, -O3, -fast).

Architecture-specific tuning

HPC systems often have specific CPU microarchitectures. Compilers can generate code tuned for them.

Typical GCC/Clang flags:

-march=native

Generate instructions for the CPU you are compiling on. Great for code built directly on the target nodes, but may break portability to older CPUs.

-mtune=native

Optimize for performance on the local CPU while still generating a more compatible instruction set (depends on architecture).

On clusters, you often need to:

Know the node microarchitecture (e.g., Intel Skylake, AMD Zen3).
Use something like -march=skylake-avx512 or -march=znver3 (if allowed).

Important considerations:

Portability vs performance:

A binary built with -march=native on a new CPU might not run on older nodes.

Module systems:

Some toolchains provided by modules are already tuned for the cluster CPU, reducing your need to specify -march explicitly.

As a rule, cluster-wide production codes either:

Are built with conservative but still tuned flags that work on all nodes, or
Are built per-partition/architecture with different flags and stored separately.

Vectorization and SIMD flags

Vectorization is crucial in HPC. Build-time options often control how aggressively the compiler vectorizes loops and which instruction sets it may use.

Common GCC/Clang examples:

-ftree-vectorize (usually included in -O3)
-fopt-info-vec or -fopt-info-vec-missed to get reports on (missed) vectorization
-mavx2, -mavx512f, etc., to allow specific SIMD instruction sets

Vendor compilers expose similar controls (e.g., Intel -xHOST, -xCORE-AVX2, -xCORE-AVX512).

Practices for optimized builds:

Enable vectorization (typically via -O2/-O3).
Check compiler reports to ensure key loops are vectorized.
Use architecture flags to allow advanced SIMD if your hardware and deployment model permit it.

Floating-point behavior and “fast math”

Optimized builds often adjust floating-point settings:

Examples (GCC/Clang):

-ffast-math

Enables multiple unsafe FP optimizations (e.g., reassociation, assuming no NaNs/infs, etc.).

-fno-math-errno, -funsafe-math-optimizations, -fno-trapping-math

Finer-grained controls, often implicitly enabled by -ffast-math.

Intel and other compilers have similar flags, e.g.:

-fimf-precision=low or model-specific fast math options.

Trade-offs:

Pros:

Often substantial speedup in math-heavy code (linear algebra, PDE solvers, etc.).

Cons:

Results may differ from strict IEEE behavior.
Runs might not be bitwise reproducible across compilers or even optimization levels.
May expose numerical instabilities in poorly conditioned algorithms.

In an optimized HPC build:

Decide ahead of time if fast math is acceptable for your domain.
Validate scientific results after enabling these options.
Document which math/precision flags you used (important for reproducibility).

Link-time optimization (LTO)

Link-time optimization allows the compiler to optimize across translation units (across different .c/.cpp files) at the link stage.

Typical GCC/Clang flags:

Compile: -flto
Link: -flto (and use a compatible linker)

Benefits:

Inlining and constant propagation across files.
Potentially smaller and faster binaries.

Costs:

Longer build times.
More complex toolchain requirements (linker, ar, ranlib must support LTO).
Some profilers or debuggers may be less friendly to heavily LTO-optimized binaries.

Usage pattern in an optimized build:

Add -flto to CFLAGS/CXXFLAGS and LDFLAGS for the “performance” configuration.
Keep a simpler non-LTO debug configuration.

Profile-Guided Optimization (PGO)

PGO uses runtime profiling information to guide optimizations:

Instrumentation build

Compile with flags to collect profiling data (e.g., -fprofile-generate).

Training runs

Run representative inputs to exercise typical code paths.

Optimized build

Recompile with -fprofile-use to let the compiler optimize based on observed behavior.

Pros:

Better branch prediction layout.
More informed inlining decisions.
Can significantly improve performance for large applications.

Cons:

More complicated build pipeline.
Requires stable, representative workloads.
Not always compatible with simple module-based deployment models on clusters.

In HPC, PGO is mainly used for:

Large simulation codes or libraries with long lifetimes.
Performance-critical kernels that justify the extra complexity.

Build-type separation: debug vs optimized

An HPC workflow almost always maintains at least two build types:

Debug build

Low optimization (-O0 or -O1).
Full debug info (-g).
Extra checks/assertions enabled.
Easier to debug but slower.

Optimized (release) build

High optimization (-O2/-O3, optional -Ofast).
May still include minimal debug symbols (-g) for backtraces.
Assertions often disabled (-DNDEBUG in C/C++).
Tuned for target architecture.

In Make/CMake or other build systems:

Use separate build directories:

build-debug/, build-release/, build-perf/, etc.

Use variables or build types:

Make: set CFLAGS, CXXFLAGS, FFLAGS, LDFLAGS differently per target.
CMake: -DCMAKE_BUILD_TYPE=Release vs Debug vs RelWithDebInfo.

This separation lets you:

Debug logic errors without optimization interference.
Run large-scale jobs with maximum performance.

Libraries and linking in optimized builds

Optimized builds should also link against optimized libraries:

Use vendor- or cluster-provided math and linear algebra libraries (e.g., MKL, OpenBLAS, vendor BLAS/LAPACK, FFT libraries).
Static vs dynamic linking:

Static linking can sometimes give small performance gains or portability advantages, but may increase binary size and complicate deployment.

Ensure consistency of compiler and library:

Use libraries built with a compatible compiler and ABI.
Prefer modules/toolchains that provide a coherent stack (compiler + MPI + BLAS + FFT).

Typical HPC practice:

Load a module (e.g., module load gcc/… openmpi/… openblas/…).
Use environment variables or provided wrappers (e.g., mpicc, mpif90) which already set correct optimized link flags.
Avoid mixing unrelated compilers and libraries in a single optimized build.

Reproducibility and configuration recording

Optimized builds can be complex; you need to record how they were produced:

Save the exact compiler version and module list used.
Record key compilation flags:

Optimization level.
Architecture-specific and vectorization flags.
Floating-point and fast-math flags.
LTO / PGO usage.

Prefer build systems that can generate a summary configuration (e.g., CMakeCache.txt, custom config.log, or a script that prints build info at runtime).

For production HPC runs:

Keep a copy of the build configuration alongside results.
Consider embedding version and build flags into the binary (via #define/__DATE__, or build-system-generated headers).

Practical patterns for optimized builds on clusters

Common patterns you are likely to see or use:

Simple release build with generic optimizations

-O2 -g -march=x86-64 -mtune=native
Balanced performance and portability.

Aggressive node-specific build

-O3 -Ofast -march=native -funroll-loops -flto
Only if you are sure binaries stay on compatible nodes and results are validated.

Hybrid debug/perf builds

-O2 -g (or RelWithDebInfo in CMake)
Reasonable performance but still debuggable and profiled effectively.

Highly tuned library builds

Used mainly by library maintainers or system admins.
Multiple variants per architecture, PGO, LTO, and tailored math flags.

As you progress, you will learn which combination makes sense for your application and cluster, and how to evolve from a simple -O2 build to more sophisticated optimized builds guided by measurement and profiling.

Comments

Please login to add a comment.

Don't have an account? Register now!