Table of Contents
Why optimization flags matter in HPC
Compiler optimization flags control how aggressively the compiler transforms your code to run faster or use fewer resources. In HPC, these flags can easily make the difference between a program that runs in 10 hours and one that runs in 4—without changing a single line of source code.
The trade‑off: the more aggressive the optimization, the longer compilation takes, and the harder it can be to debug or guarantee strict language semantics (especially with floating‑point math).
This chapter focuses on:
- Typical optimization levels (e.g.
-O0…-O3,-Ofast) - CPU‑specific tuning (
-march,-mtune, vendor equivalents) - Floating‑point–related flags (fast‑math, fused operations)
- Vectorization‑related flags
- Debug–vs–performance combinations used in HPC practice
Examples will mostly use GCC‑style syntax; Intel and LLVM/Clang variants are pointed out where important.
Basic optimization levels
Most compilers group many low‑level transformations into a few optimization “levels”.
`-O0`: no optimization
- Goal: fastest compilation, easiest debugging.
- Compiler does minimal transformation.
- Characteristics:
- Code is usually much slower.
- Line‑by‑line debugging matches source best.
- Usage:
- Early development.
- Debug builds.
- GCC/Clang/Intel:
-O0means “turn off almost all optimizations”.
`-O1`: basic optimization
- Goal: some speedup, still relatively quick to compile and debug.
- Common enabling:
- Simple inlining.
- Local common‑subexpression elimination.
- Basic loop optimizations.
- Typical usage:
- Development when you want code that’s closer to production behavior but still fairly debuggable.
`-O2`: general‑purpose high optimization
- This is usually the default optimization level for production builds in many HPC projects.
- Compiler turns on a wide range of safe optimizations:
- Loop transformations, dead‑code elimination, inlining (more than
-O1), register allocation improvements, etc. - Goals:
- Good runtime performance without being too aggressive about risky assumptions.
- Reasonable compile times.
- Typical usage:
- Production builds when you care about correctness and portability.
- Baseline for performance tuning.
`-O3`: more aggressive optimization
- Adds more aggressive transformations aimed at performance, especially for compute‑intensive code:
- More inlining, unrolling, vectorization, and loop transformations.
- Pros:
- Often significantly faster on numeric kernels and tight loops.
- Cons:
- Code size grows (more cache pressure).
- Can sometimes hurt performance for memory‑bound codes or very large applications.
- May make debugging more difficult.
- Usage in HPC:
- Common for numerical libraries and performance‑critical codes.
- Often combined with architecture‑specific flags.
`-Ofast`: speed over strict standards
Name differs slightly by compiler, but the idea is similar:
- GCC/Clang:
-Ofast=-O3plus options that may break strict standards compliance, most notably: - Enables
-ffast-mathand related flags. - Intel (e.g.
-fastor/fast): a bundle of aggressive flags: - High optimization, fast math, architecture targeting, etc.
Consequences:
- Can reorder floating‑point operations more freely.
- May ignore some corner‑case behaviors (like NaNs, infinities, signed zeros).
- Can change numerical results slightly; occasionally changes convergence behavior in iterative solvers.
Typical HPC usage:
- Consider for final performance runs on well‑tested codes where small numerical differences are acceptable.
- Avoid for:
- Debugging.
- Validation runs where bitwise reproducibility is required.
- Codes that rely on precise IEEE 754 behavior or strict ordering of operations.
CPU architecture–specific flags
Compilers can generate code tuned for specific CPUs, exploiting newer instructions such as AVX2 or AVX‑512.
`-march` (machine architecture)
- Tells the compiler what CPU it is allowed to target.
- Example (GCC/Clang):
-march=native
Detects the current CPU and uses all supported features.-march=skylakeor-march=skylake-avx512-march=znver2(AMD Zen 2), etc.
Effects:
- Enables use of instruction sets (SSE, AVX, AVX2, AVX‑512, etc.).
- May change ABI in subtle ways; binaries may not run on older CPUs.
HPC implication:
- On shared clusters, do not blindly use
-march=nativewhen building on a login node; compute nodes might differ. - Prefer an architecture that matches the oldest CPU type you must support, or build separate binaries per partition.
`-mtune` (tuning, but keep compatibility)
- Tunes scheduling and instruction selection for a target CPU, but keeps the instruction set compatible with a more generic baseline.
- Example:
-march=x86-64 -mtune=skylake- Good for: building binaries that must run on a wide range of hardware, but tuned for the most common architecture.
Vendor‑specific examples
- Intel oneAPI / classic Intel:
-xHost
Like-march=native: use the highest instruction set available on the build machine.-xCORE-AVX2,-xCORE-AVX512, etc.
Target specific microarchitectures.-axCORE-AVX2
Generate multi‑versioned code that runs broadly but has optimized paths for specific CPUs at runtime.- AMD AOCC (similar to Clang/GCC, with AMD‑tuned
-march=znverXoptions).
HPC practice:
- For production modules, clusters often provide multiple builds of key libraries:
- “Generic” (e.g. baseline x86‑64).
- AVX2‑optimized.
- AVX‑512‑optimized.
- For your own codes, coordinate with cluster documentation: they often recommend a safe
-marchfor each partition.
Floating‑point–related flags
Floating‑point math is central in HPC and is sensitive to compiler assumptions. Many aggressive optimizations are controlled by floating‑point flags.
Fast‑math bundles
Common idea: relax IEEE 754 and language constraints to allow more transformations.
- GCC:
-ffast-math
Bundle that enables multiple flags (-fno-math-errno,-funsafe-math-optimizations, etc.).- Clang: similar semantics for
-ffast-math. - Intel: often enabled with
-fp-model fast,-fp-model fast=2, or via umbrella flags like-Ofastor-fast.
Typical transformations:
- Reorder operations assuming associativity: $(a + b) + c = a + (b + c)$ (not strictly true in floating point).
- Use reciprocal and reciprocal square root approximations (
1/xand1/sqrt(x)) with refinement. - Assume no NaNs, no infinities, no subnormal numbers.
- Fuse multiply‑add into FMA instructions where available.
Implications:
- Faster and more vectorizable code.
- Numerical results may differ (at ULP level or more).
- May affect convergence or stability for some algorithms.
HPC guideline:
- Use for performance testing on well‑understood algorithms.
- Compare against a more conservative build to ensure differences are acceptable.
- Avoid when you require bit‑reproducible runs (e.g. for regression testing or some scientific workflows).
Strict vs relaxed models (Intel, others)
Intel style:
-fp-model strict
Try to conform strictly to IEEE and language rules; fewer transformations.-fp-model precise
Good balance of correctness and performance for many codes.-fp-model fast,-fp-model fast=2
Increasingly aggressive assumptions, similar to fast‑math.
HPC usage:
- Often: development/validation with
strictorprecise, production withpreciseor a controlled level offast.
Fused multiply-add (FMA)
Modern CPUs support an FMA instruction computing:
$$
a \times b + c
$$
in a single step with one rounding. Benefits:
- Higher throughput.
- Often more precise than separate multiply + add.
Compiler flags:
- GCC/Clang:
-ffp-contract=fast(or via-Ofast/-ffast-math) encourage fusion. - Intel: FMA is often enabled automatically when targeting FMA‑capable architectures.
Caveat:
- Results may differ slightly from non‑FMA builds; this matters only if you rely on strict reproducibility.
Vectorization‑related flags
Vectorization is covered conceptually elsewhere; here we focus on flags that affect whether the compiler will vectorize loops and how much it tells you.
Enabling auto‑vectorization
For most modern compilers:
- GCC: auto‑vectorization is typically enabled at
-O3and often partly at-O2. - Older GCC used:
-ftree-vectorizeexplicitly; now it’s folded into higher-Olevels. - Clang: vectorization at
-O3, some at-O2. - Intel:
-O2already does vectorization for Intel compilers;-O3pushes harder.
You generally don’t need a “turn on vectorization” flag; the optimization level and architecture flags are more important.
Controlling assumptions about aliasing
Compilers are conservative when they think pointers might overlap (“alias”), which can block vectorization.
- GCC:
-fstrict-aliasing(enabled by default at-O2and above).- Many HPC codes also use C/C++
restrictor__restrict__to help the compiler.
Be aware: misusing restrict or strict aliasing can lead to wrong results, not just slow code.
Reporting vectorization decisions
Very important for HPC tuning: ask the compiler what it is doing.
- GCC:
-fopt-info-vec
Emit vectorization reports (where loops are vectorized, where they are not, and why).-fopt-info-optimizedor-fopt-info-allfor broader optimization reports.- Clang:
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize- Intel:
-qopt-report=5 -qopt-report-phase=vec(or similar)
Generate a detailed report file about vectorization.
HPC usage:
- Compile with
-O3(and architecture flags), plus a reporting option, on a key source file. - Inspect loops that failed to vectorize; adjust code or hints (pragmas) accordingly.
Common optimization flag sets in HPC
In practice, you often don’t pick individual flags from scratch. Instead you choose profiles appropriate to the stage of development.
Debug builds (development)
Goals: easy debugging, no aggressive reordering, compile fast.
Typical GCC/Clang:
-O0 -g- Optionally add:
-fno-omit-frame-pointerfor better backtraces.-Wall -Wextra(warning flags, not optimizations but often included).
For threaded/MPI + debug, you may also add sanitizer flags (which will be covered elsewhere).
“Check” builds (debuggable but somewhat optimized)
Goals: resemble production performance behavior but still debuggable.
Example:
-O1 -gor-O2 -g- Sometimes:
- Avoid
-ffast-mathand-Ofast. - May disable inlining of very large functions if debugging is difficult.
These builds are useful for diagnosing performance bugs and checking correctness on medium‑sized test cases.
Production builds (performance)
Goals: maximum speed with acceptable numerical behavior.
Typical baseline (GCC/Clang):
- CPU‑agnostic but optimized:
-O3 -gor-O3- Possibly
-march=x86-64 -mtune=nativeor cluster‑recommended tuning. - CPU‑specific:
-O3 -march=skylake-avx512 -mtune=skylake-avx512- Or
-Ofast -march=...if numerical assumptions are acceptable.
Intel example:
-O3 -xCORE-AVX2 -qopenmp(if using OpenMP)- For more aggression: add
-Ofastor-fp-model fast=2after validating results.
HPC recommendation:
- Start with
-O2or-O3and architecture flags. - Measure performance.
- Then test
-Ofast/fast‑math variants carefully to see whether the numerical differences and speedup are acceptable.
Interactions with debugging and profiling tools
Optimization flags can affect the usefulness of debugging/profiling output:
- Optimized code may:
- Reorder instructions.
- Inline functions (stack traces look different).
- Remove variables completely (you can’t inspect them).
- Profilers may still work well (especially sampling‑based tools), but line attribution can be less clear.
Typical approach in HPC:
- Debug/validation:
-O0 -gor-O1 -g
Use debugger, address/undefined behavior sanitizers. - Profiling:
-O2 -gor-O3 -g
Keeps debug info for mapping hotspots to source, but retains performance characteristics. - Final runs: may remove
-gfor slightly smaller binaries and faster load, though many leave it on if disk space is not critical.
Practical tips for choosing flags on a cluster
- Read the cluster documentation
Many centers prescribe or recommend specific flag sets for their CPUs and compilers. These are often a good default. - Be explicit and consistent
- Put your chosen flags in a build system (
Makefile,CMakeLists.txt). - Separate debug vs release configurations.
- Benchmark systematically
- Compare
-O2vs-O3vs-Ofaston realistic workloads. - Measure, don’t guess.
- Validate numerical results
- When enabling fast‑math or changing optimization levels, compare against a trusted baseline.
- Decide what level of difference is acceptable for your application.
- Beware of portability
- Architecture‑specific flags may produce binaries that fail on older nodes.
- If your job can run on multiple partitions, build for the lowest common denominator or create multiple optimized builds.
- Use reports
- Combine optimization flags with compiler reports to understand what’s actually happening (vectorization, inlining, etc.), which is especially valuable for HPC optimization work.
Summary
Compiler optimization flags are a powerful, low‑effort way to improve HPC application performance. The key decisions are:
- Which optimization level (
-O0…-O3,-Ofast) fits your stage (debugging vs production)? - How aggressively to target specific CPUs (
-march,-mtune, vendor‑specific-x/-ax)? - How much to relax floating‑point strictness (
-ffast-math,-fp-model fast)? - How to observe and trust the compiler’s optimizations (vectorization reports, benchmarking, validation)?
Used thoughtfully, these flags can yield large speedups with minimal code changes, while preserving the correctness and reproducibility standards required in scientific and engineering HPC workloads.