Table of Contents
What “optimized build” means in HPC
An optimized build is a version of your program compiled with flags and settings that aim for maximum performance on a target machine, often at the cost of:
- Longer compile times
- Larger binaries
- Less helpful debugging information
- Reduced portability (tuned for a specific CPU/GPU)
In HPC, “optimized” usually means:
- Enabling high levels of compiler optimization (
-O2,-O3, etc.) - Turning on vectorization and architecture-specific tuning
- Choosing math library and floating‑point trade‑offs consciously
- Using link‑time optimization and profile‑guided optimization when useful
This chapter focuses on how to build such binaries, not on all the underlying performance theory (covered in performance-related chapters).
Typical optimization levels
Most compilers follow a similar hierarchy; details differ by compiler, but conceptually:
-O0- No optimization. Direct translation of code, fastest compilation. Use for debugging only.
-O1- Light optimizations, generally safe, still relatively quick to compile.
-O2- “Workhorse” optimization level. Most HPC codes use at least this. Good balance of speed and compile time.
-O3- Aggressive optimizations that may significantly change code structure. Often faster, but not always; may increase memory usage or cause performance regressions in some cases.
-Ofast(compiler-specific)- Like
-O3plus unsafe math/standards relaxations (e.g., ignoring strict IEEE/ISO semantics). May break strict numerical reproducibility or correctness in edge cases.
On GCC/Clang, for example:
-O2is usually a safe default for production.-O3/-Ofastshould be tested carefully against correctness and performance.
On Intel and other vendor compilers, names differ but the ideas are similar (e.g. Intel -O2, -O3, -fast).
Architecture-specific tuning
HPC systems often have specific CPU microarchitectures. Compilers can generate code tuned for them.
Typical GCC/Clang flags:
-march=native- Generate instructions for the CPU you are compiling on. Great for code built directly on the target nodes, but may break portability to older CPUs.
-mtune=native- Optimize for performance on the local CPU while still generating a more compatible instruction set (depends on architecture).
On clusters, you often need to:
- Know the node microarchitecture (e.g., Intel Skylake, AMD Zen3).
- Use something like
-march=skylake-avx512or-march=znver3(if allowed).
Important considerations:
- Portability vs performance:
- A binary built with
-march=nativeon a new CPU might not run on older nodes. - Module systems:
- Some toolchains provided by modules are already tuned for the cluster CPU, reducing your need to specify
-marchexplicitly.
As a rule, cluster-wide production codes either:
- Are built with conservative but still tuned flags that work on all nodes, or
- Are built per-partition/architecture with different flags and stored separately.
Vectorization and SIMD flags
Vectorization is crucial in HPC. Build-time options often control how aggressively the compiler vectorizes loops and which instruction sets it may use.
Common GCC/Clang examples:
-ftree-vectorize(usually included in-O3)-fopt-info-vecor-fopt-info-vec-missedto get reports on (missed) vectorization-mavx2,-mavx512f, etc., to allow specific SIMD instruction sets
Vendor compilers expose similar controls (e.g., Intel -xHOST, -xCORE-AVX2, -xCORE-AVX512).
Practices for optimized builds:
- Enable vectorization (typically via
-O2/-O3). - Check compiler reports to ensure key loops are vectorized.
- Use architecture flags to allow advanced SIMD if your hardware and deployment model permit it.
Floating-point behavior and “fast math”
Optimized builds often adjust floating-point settings:
Examples (GCC/Clang):
-ffast-math- Enables multiple unsafe FP optimizations (e.g., reassociation, assuming no NaNs/infs, etc.).
-fno-math-errno,-funsafe-math-optimizations,-fno-trapping-math- Finer-grained controls, often implicitly enabled by
-ffast-math.
Intel and other compilers have similar flags, e.g.:
-fimf-precision=lowor model-specific fast math options.
Trade-offs:
- Pros:
- Often substantial speedup in math-heavy code (linear algebra, PDE solvers, etc.).
- Cons:
- Results may differ from strict IEEE behavior.
- Runs might not be bitwise reproducible across compilers or even optimization levels.
- May expose numerical instabilities in poorly conditioned algorithms.
In an optimized HPC build:
- Decide ahead of time if fast math is acceptable for your domain.
- Validate scientific results after enabling these options.
- Document which math/precision flags you used (important for reproducibility).
Link-time optimization (LTO)
Link-time optimization allows the compiler to optimize across translation units (across different .c/.cpp files) at the link stage.
Typical GCC/Clang flags:
- Compile:
-flto - Link:
-flto(and use a compatible linker)
Benefits:
- Inlining and constant propagation across files.
- Potentially smaller and faster binaries.
Costs:
- Longer build times.
- More complex toolchain requirements (linker, ar, ranlib must support LTO).
- Some profilers or debuggers may be less friendly to heavily LTO-optimized binaries.
Usage pattern in an optimized build:
- Add
-fltotoCFLAGS/CXXFLAGSandLDFLAGSfor the “performance” configuration. - Keep a simpler non-LTO debug configuration.
Profile-Guided Optimization (PGO)
PGO uses runtime profiling information to guide optimizations:
- Instrumentation build
- Compile with flags to collect profiling data (e.g.,
-fprofile-generate). - Training runs
- Run representative inputs to exercise typical code paths.
- Optimized build
- Recompile with
-fprofile-useto let the compiler optimize based on observed behavior.
Pros:
- Better branch prediction layout.
- More informed inlining decisions.
- Can significantly improve performance for large applications.
Cons:
- More complicated build pipeline.
- Requires stable, representative workloads.
- Not always compatible with simple module-based deployment models on clusters.
In HPC, PGO is mainly used for:
- Large simulation codes or libraries with long lifetimes.
- Performance-critical kernels that justify the extra complexity.
Build-type separation: debug vs optimized
An HPC workflow almost always maintains at least two build types:
- Debug build
- Low optimization (
-O0or-O1). - Full debug info (
-g). - Extra checks/assertions enabled.
- Easier to debug but slower.
- Optimized (release) build
- High optimization (
-O2/-O3, optional-Ofast). - May still include minimal debug symbols (
-g) for backtraces. - Assertions often disabled (
-DNDEBUGin C/C++). - Tuned for target architecture.
In Make/CMake or other build systems:
- Use separate build directories:
build-debug/,build-release/,build-perf/, etc.- Use variables or build types:
- Make: set
CFLAGS,CXXFLAGS,FFLAGS,LDFLAGSdifferently per target. - CMake:
-DCMAKE_BUILD_TYPE=ReleasevsDebugvsRelWithDebInfo.
This separation lets you:
- Debug logic errors without optimization interference.
- Run large-scale jobs with maximum performance.
Libraries and linking in optimized builds
Optimized builds should also link against optimized libraries:
- Use vendor- or cluster-provided math and linear algebra libraries (e.g., MKL, OpenBLAS, vendor BLAS/LAPACK, FFT libraries).
- Static vs dynamic linking:
- Static linking can sometimes give small performance gains or portability advantages, but may increase binary size and complicate deployment.
- Ensure consistency of compiler and library:
- Use libraries built with a compatible compiler and ABI.
- Prefer modules/toolchains that provide a coherent stack (compiler + MPI + BLAS + FFT).
Typical HPC practice:
- Load a module (e.g.,
module load gcc/… openmpi/… openblas/…). - Use environment variables or provided wrappers (e.g.,
mpicc,mpif90) which already set correct optimized link flags. - Avoid mixing unrelated compilers and libraries in a single optimized build.
Reproducibility and configuration recording
Optimized builds can be complex; you need to record how they were produced:
- Save the exact compiler version and module list used.
- Record key compilation flags:
- Optimization level.
- Architecture-specific and vectorization flags.
- Floating-point and fast-math flags.
- LTO / PGO usage.
- Prefer build systems that can generate a summary configuration (e.g.,
CMakeCache.txt, customconfig.log, or a script that prints build info at runtime).
For production HPC runs:
- Keep a copy of the build configuration alongside results.
- Consider embedding version and build flags into the binary (via
#define/__DATE__, or build-system-generated headers).
Practical patterns for optimized builds on clusters
Common patterns you are likely to see or use:
- Simple release build with generic optimizations
-O2 -g -march=x86-64 -mtune=native- Balanced performance and portability.
- Aggressive node-specific build
-O3 -Ofast -march=native -funroll-loops -flto- Only if you are sure binaries stay on compatible nodes and results are validated.
- Hybrid debug/perf builds
-O2 -g(orRelWithDebInfoin CMake)- Reasonable performance but still debuggable and profiled effectively.
- Highly tuned library builds
- Used mainly by library maintainers or system admins.
- Multiple variants per architecture, PGO, LTO, and tailored math flags.
As you progress, you will learn which combination makes sense for your application and cluster, and how to evolve from a simple -O2 build to more sophisticated optimized builds guided by measurement and profiling.