Table of Contents
What is an HPC Software Stack?
In an HPC context, a software stack is the ordered collection of software layers that turn bare metal (or cloud instances) into a usable scientific computing environment.
You can think of it as several layers, from bottom to top:
- Hardware
CPUs, GPUs, interconnects, storage. - System software
OS, kernel, device drivers, low‑level libraries (e.g.glibc). - Core HPC infrastructure
Resource manager / job scheduler, parallel filesystems, monitoring. - Programming toolchain
Compilers, MPI libraries, math libraries, build tools, debuggers, profilers. - Domain and application layer
Simulation codes, analysis tools, Python/R environments, domain‑specific frameworks.
A cluster’s software stack is how these layers are selected, versions pinned, built, and made available in a consistent way to all users and nodes.
This chapter focuses on how these stacks are organized and what that means for your day‑to‑day work and reproducibility.
Typical Layers in an HPC Software Stack
While implementations differ between sites, most HPC software stacks share similar structural elements.
System and Core Libraries
These are usually provided and managed by the system administrators:
- Linux distribution packages
Base system tools, compilers (often older), Python, shells, system libraries via the distro’s package manager (apt,yum,dnf,zypper, etc.). - C runtime and basic system libraries
glibc,libm, low‑level networking and threading libraries (libpthread), etc. - Kernel‑related interfaces
NUMA control, hugepages support, GPU drivers, Infiniband drivers and user‑space libraries.
As a user, you rarely modify this layer, but it constrains what higher versions of compilers and libraries can be built on top of it.
Programming Toolchains
A toolchain is a coherent set of compiler, MPI, and basic support libraries that are tested to work together.
Typical components:
- Compiler family and version
gcc/13.2.0intel-oneapi-compilers/2025.0nvhpc/24.7clang/18.1- MPI implementation and version
openmpi/5.0.4mpich/4.2.2- Vendor MPI (e.g.
intel-mpi,cray-mpich) - Core numerical and communication libraries
BLAS, LAPACK, FFT, vendor-optimized math libraries (e.g. Intel MKL), and low‑level fabrics (Infiniband verbs, UCX).
Sites often define named toolchains, for example:
foss/2024b– “Free and Open Source Software” stack: GCC + OpenMPI + OpenBLAS + FFTW, etc.intel/2024a– Intel compilers + Intel MPI + MKL.
These logical groupings simplify loading consistent modules and prevent incompatible combinations.
Numerical and Scientific Libraries
Built on top of specific toolchains, these are the math and domain libraries you link to or import:
- Linear algebra and solvers
BLAS/LAPACK/ScaLAPACK, PETSc, Trilinos, MUMPS, SuperLU, hypre. - FFT and spectral libraries
FFTW, vendor FFTs (e.g. MKL FFT, cuFFT). - Mesh, IO, and data frameworks
HDF5, NetCDF, ADIOS, parallel IO wrappers, mesh and geometry libraries.
Because these libraries often need to be compiled for each compiler/MPI combination, clustering them around toolchains reduces complexity. You might see multiple builds, such as:
petsc/3.20-gcc-13.2-openmpi-5.0petsc/3.20-intel-2024-mpi
The name encodes the underlying stack, which is crucial for reproducibility.
Languages, Runtimes, and Environments
Beyond C/C++/Fortran, stacks provide higher‑level environments:
- Python stacks
- System Python (from the OS) – often avoided for scientific work.
- HPC Python environments (e.g.
python/3.12-gcc-13.2,anaconda/2024.06) with NumPy, SciPy, MPI4Py, Jupyter, etc.
These are usually compiled against the cluster’s BLAS/MPI libraries. - R environments
R/4.4-gcc-13.2pre‑built with many CRAN/Bioconductor packages using cluster math libraries. - Java, Julia, and others when needed, possibly integrated with MPI or GPU libraries.
Cluster‑provided environments help avoid conflicts between user‑installed packages and underlying system libraries.
Domain-Specific Stacks
On top of all this, many sites organize thematic or domain stacks:
- Computational chemistry / materials
VASP, Quantum ESPRESSO, CP2K, LAMMPS, GROMACS, NAMD. - CFD and engineering
OpenFOAM, SU2, Code_Saturne, commercial CFD solvers. - Climate, weather, and geoscience
WRF, CESM, NEMO, ICON, MetOffice / ECMWF tools.
These packages depend on the underlying compilers, MPI, and math libraries; they are built in ways that fit the system architecture and performance priorities.
How Software Stacks Are Organized on Clusters
Although each site has its own policies, some patterns are very common.
Centralized vs. Layered Stacks
Two broad approaches:
- Monolithic / “one big stack”
A small number of recommended “official” environments (e.g.intel/2024andfoss/2024) with most software built for those only. - Easier for admins to maintain and test.
- Simpler choices for users.
- Less flexibility if you need something unusual.
- Layered / modular stacks
Many combinations of compilers, MPI, and libraries exposed via modules. - Very flexible.
- Can be confusing for beginners.
- Higher risk of users mixing incompatible modules unless the hierarchy enforces constraints.
In both cases, a module hierarchy (covered elsewhere) is typically used to expose only compatible modules after a core module (e.g. compiler) is loaded.
Hierarchical Module Layout
A common scheme is a three‑level hierarchy:
- Core level
Only architecture‑independent modules: base compilers, Python, basic tools. - Compiler level
Libraries and tools built for a specific compiler become visible only after you load that compiler.
Example: - Load:
module load gcc/13.2 - Now you see:
hdf5/1.14-gcc,fftw/3.3.10-gcc. - MPI level
MPI‑dependent libraries appear only after an MPI module is loaded.
Example: - Load:
module load gcc/13.2 - Load:
module load openmpi/5.0 - Now you see:
petsc/3.20-gcc-openmpi,hdf5/1.14-gcc-openmpi-parallel.
This structure encodes the stack’s dependency graph into the way software is discovered and selected.
Architecture-Specific Substacks
On heterogeneous clusters, multiple architecture variants may exist:
- CPU type (e.g. Intel vs AMD, AVX2 vs AVX‑512).
- GPU vs non‑GPU nodes.
- Different network fabrics.
You may encounter modules or prefixes that indicate architecture, for example:
foss/2024b-skylakevsfoss/2024b-zen4cuda/12.4orgpusubtrees that are only useful on GPU nodes.
For reproducibility, recording the exact architecture‑specific stack is as important as the software versions.
Software Stacks and Reproducibility
Software stacks are central to reproducible HPC workflows. The same source code can behave differently or produce slightly different results depending on:
- Compiler and its optimization defaults.
- MPI implementation and configuration.
- Choice and version of math libraries.
- BLAS/LAPACK implementation (OpenBLAS vs MKL vs vendor BLAS).
- GPU toolchains and driver versions.
Why Pinning the Stack Matters
Re-running an experiment a year later often fails if you only recall:
- “I used
module load gccand ranmy_code.”
Instead, reproducibility requires a more precise capture of the entire stack, e.g.:
- Toolchain:
foss/2024b - Extra modules:
hdf5/1.14.3-foss-2024b,petsc/3.20.1-foss-2024b,python/3.12-foss-2024b - GPU stack:
cuda/12.4,cudnn/9.2(if used) - Architecture or partition:
partition=skylakeon SLURM.
Even if module names change, this information guides you (or an admin) to reconstruct an equivalent environment.
Stack Drift and Software Lifecycles
Over time, clusters:
- Introduce new toolchains (e.g.
2024a,2024b). - Deprecate older ones.
- Apply security and bug‑fix updates that may slightly change behavior.
HPC centers often:
- Maintain multiple stack generations in parallel (e.g. “2023 stack” and “2024 stack”).
- Announce end‑of‑life dates for old stacks.
- Encourage users to migrate codes and workflows to newer stacks.
For long‑running projects, it can be wise to:
- Standardize on a specific stack version (e.g.
foss/2023b). - Freeze your workflows and record exact module sets.
- Plan for periodic re‑validation when moving to a new stack generation.
Strategies for Working with Software Stacks
This section focuses on practical ways to interact with the stack so your work remains manageable and reproducible.
Use Recommended / Default Stacks When Possible
Most centers publish “recommended environments”:
- e.g. “Use
foss/2024bfor general CPU work” - e.g. “Use
nvhpc/24.7 + cuda/12.4for GPU codes on partitiongpu”
Benefits:
- Better support from admins and documentation.
- More testing and fewer surprises.
- Easier to find examples and teaching materials.
Unless you have strong reasons to deviate, start with these defaults.
Record Your Stack Automatically
In your job scripts and analysis notebooks, record the active environment:
- Save module list:
module list 2>&1 | tee modules_used.txtor
module -t list > modules_used.txt- Store environment variables:
env | sort > environment.txt- Log compiler/MPI details from your code (e.g., calling
MPI_Get_library_versionor printing__VERSION__macros in C/C++).
These logs become part of your reproducibility record alongside input data and code.
Encapsulate Stacks in Setup Scripts
Rather than relying on interactive module load sequences you might forget, create a small setup script:
# myproject_env.sh
module purge
module load foss/2024b
module load hdf5/1.14.3-foss-2024b
module load petsc/3.20.1-foss-2024b
module load python/3.12-foss-2024bThen:
- In your shell:
source myproject_env.sh - In job scripts:
source /path/to/myproject_env.sh
This ensures you, your collaborators, and your future self always use the same stack.
Avoid Mixing Incompatible Substacks
Some common pitfalls:
- Loading libraries from different toolchains simultaneously (e.g.
gcc-built HDF5 withintel-built NetCDF). - Mixing MPI‑enabled and serial versions of the same library.
- Combining CPU‑only modules with GPU‑only builds incorrectly.
Typical symptoms:
- Link‑time errors (undefined references, symbol clashes).
- Run‑time crashes or subtle numerical differences.
Mitigation:
- Respect the module hierarchy: load a compiler, then MPI, then only modules that appear under that hierarchy.
- Prefer single, coherent toolchains (e.g.
fossorintel) instead of cherry‑picking pieces from different ones. - If unsure, ask for the supported combination for your application.
When the Stack Does Not Have What You Need
Even with large stacks, you may need:
- A newer version of a library.
- A niche tool.
- A specific build configuration.
Options (from least to most isolated):
- User‑level builds on top of the site stack
- Compile in your home or project directory, using compilers and MPI from the system modules.
- Advantage: benefits from tuned compilers and math libraries.
- Keep build scripts and module load recipes under version control.
- Per‑project Python / R environments
- Create
virtualenv,condaenvs, or R libraries that use the system toolchain underneath. - Be explicit about how those environments were created (YAML lock files,
requirements.txt, etc.). - Containers (covered elsewhere)
- Wrap or mirror the cluster’s stack into a Singularity/Apptainer image.
- Good for shipping the same environment across multiple systems or preserving older stacks past their retirement date.
Whichever path you choose, tie it clearly to the underlying stack to aid reproducibility.
Software Stack Management Tools
Many centers use meta‑tools to help define and maintain software stacks. Understanding their role can help you interpret module naming and versioning.
EasyBuild, Spack, and Similar Tools
Common examples:
- EasyBuild
Declarative “easyconfigs” describing how software should be built for particular toolchains and architectures.
Often used to create module hierarchies and named toolchains (foss,intel, etc.). - Spack
Package manager specialized for HPC that can generate modules and manage many dependency versions and variants (+mpi,~cuda, etc.).
As an end‑user:
- You may see module names that reflect these tools’ conventions (e.g.
hdf5/1.14.3-gompi-2024b). - Some centers allow advanced users to run Spack themselves to build personal stacks.
- Even if you don’t use these tools directly, they structure much of the site’s stack.
Site Policies and Documentation
HPC centers usually document:
- Which stacks are supported.
- Which stacks are deprecated.
- Recommended stacks per use case (CPU vs GPU, MPI vs shared‑memory, etc.).
- Example job scripts that assume a particular stack.
Consulting these documents helps you avoid unsupported configurations that might break silently or be hard to debug.
Designing Your Own “Logical” Software Stack
On top of the physical stack provided by the cluster, you often want to define a logical stack tailored to your project.
Characteristics:
- Purpose‑built
Contains only the software needed for that project (compilers, MPI, libraries, Python/R packages). - Version‑fixed
You explicitly pin versions rather than always taking the latest module. - Reproducibly described
You maintain a text or code description of the stack (e.g. a shell script, Makefile, or configuration file).
A simple example structure for a project:
env/modules.sh– loads cluster modules (toolchain, libraries).python_env.yml– defines a Conda or virtualenv.R_packages.txt– records installed R packages.build/CMakePresets.jsonor Makefile linking against the chosen stack.docs/stack.md– a short description: which cluster, which partition, which module set.
This logical stack bridges the gap between the site‑wide stack and the exact environment needed for your specific workflows.
Summary
- An HPC software stack is the layered set of system software, toolchains, libraries, and applications that work together to form your computing environment.
- Stacks are structured around compilers, MPI implementations, and architectures, usually exposed via a hierarchical module system.
- Reproducibility depends strongly on capturing your exact stack: toolchains, library versions, architecture, and how they were combined.
- Use recommended stacks when possible, and record your environment via scripts and logs.
- Avoid mixing incompatible substacks; adhere to module hierarchies and supported combinations.
- When site stacks are insufficient, build on top of them with user‑level installs or containers, documenting the relationship carefully.
- For each project, design a logical stack description that can be recreated, shared, and version‑controlled as part of your reproducible workflow.