15.1 Software stacks

Table of Contents

What is an HPC Software Stack?

In an HPC context, a software stack is the ordered collection of software layers that turn bare metal (or cloud instances) into a usable scientific computing environment.

You can think of it as several layers, from bottom to top:

Hardware
CPUs, GPUs, interconnects, storage.
System software
OS, kernel, device drivers, low‑level libraries (e.g. glibc).
Core HPC infrastructure
Resource manager / job scheduler, parallel filesystems, monitoring.
Programming toolchain
Compilers, MPI libraries, math libraries, build tools, debuggers, profilers.
Domain and application layer
Simulation codes, analysis tools, Python/R environments, domain‑specific frameworks.

A cluster’s software stack is how these layers are selected, versions pinned, built, and made available in a consistent way to all users and nodes.

This chapter focuses on how these stacks are organized and what that means for your day‑to‑day work and reproducibility.

Typical Layers in an HPC Software Stack

While implementations differ between sites, most HPC software stacks share similar structural elements.

System and Core Libraries

These are usually provided and managed by the system administrators:

Linux distribution packages
Base system tools, compilers (often older), Python, shells, system libraries via the distro’s package manager (apt, yum, dnf, zypper, etc.).
C runtime and basic system libraries
glibc, libm, low‑level networking and threading libraries (libpthread), etc.
Kernel‑related interfaces
NUMA control, hugepages support, GPU drivers, Infiniband drivers and user‑space libraries.

As a user, you rarely modify this layer, but it constrains what higher versions of compilers and libraries can be built on top of it.

Programming Toolchains

A toolchain is a coherent set of compiler, MPI, and basic support libraries that are tested to work together.

Typical components:

Compiler family and version

gcc/13.2.0
intel-oneapi-compilers/2025.0
nvhpc/24.7
clang/18.1

MPI implementation and version

openmpi/5.0.4
mpich/4.2.2
Vendor MPI (e.g. intel-mpi, cray-mpich)

Core numerical and communication libraries
BLAS, LAPACK, FFT, vendor-optimized math libraries (e.g. Intel MKL), and low‑level fabrics (Infiniband verbs, UCX).

Sites often define named toolchains, for example:

foss/2024b – “Free and Open Source Software” stack: GCC + OpenMPI + OpenBLAS + FFTW, etc.
intel/2024a – Intel compilers + Intel MPI + MKL.

These logical groupings simplify loading consistent modules and prevent incompatible combinations.

Numerical and Scientific Libraries

Built on top of specific toolchains, these are the math and domain libraries you link to or import:

Linear algebra and solvers
BLAS/LAPACK/ScaLAPACK, PETSc, Trilinos, MUMPS, SuperLU, hypre.
FFT and spectral libraries
FFTW, vendor FFTs (e.g. MKL FFT, cuFFT).
Mesh, IO, and data frameworks
HDF5, NetCDF, ADIOS, parallel IO wrappers, mesh and geometry libraries.

Because these libraries often need to be compiled for each compiler/MPI combination, clustering them around toolchains reduces complexity. You might see multiple builds, such as:

petsc/3.20-gcc-13.2-openmpi-5.0
petsc/3.20-intel-2024-mpi

The name encodes the underlying stack, which is crucial for reproducibility.

Languages, Runtimes, and Environments

Beyond C/C++/Fortran, stacks provide higher‑level environments:

Python stacks

System Python (from the OS) – often avoided for scientific work.
HPC Python environments (e.g. python/3.12-gcc-13.2, anaconda/2024.06) with NumPy, SciPy, MPI4Py, Jupyter, etc.
These are usually compiled against the cluster’s BLAS/MPI libraries.

R environments
R/4.4-gcc-13.2 pre‑built with many CRAN/Bioconductor packages using cluster math libraries.
Java, Julia, and others when needed, possibly integrated with MPI or GPU libraries.

Cluster‑provided environments help avoid conflicts between user‑installed packages and underlying system libraries.

Domain-Specific Stacks

On top of all this, many sites organize thematic or domain stacks:

Computational chemistry / materials
VASP, Quantum ESPRESSO, CP2K, LAMMPS, GROMACS, NAMD.
CFD and engineering
OpenFOAM, SU2, Code_Saturne, commercial CFD solvers.
Climate, weather, and geoscience
WRF, CESM, NEMO, ICON, MetOffice / ECMWF tools.

These packages depend on the underlying compilers, MPI, and math libraries; they are built in ways that fit the system architecture and performance priorities.

How Software Stacks Are Organized on Clusters

Although each site has its own policies, some patterns are very common.

Centralized vs. Layered Stacks

Two broad approaches:

Monolithic / “one big stack”
A small number of recommended “official” environments (e.g. intel/2024 and foss/2024) with most software built for those only.

Easier for admins to maintain and test.
Simpler choices for users.
Less flexibility if you need something unusual.

Layered / modular stacks
Many combinations of compilers, MPI, and libraries exposed via modules.

Very flexible.
Can be confusing for beginners.
Higher risk of users mixing incompatible modules unless the hierarchy enforces constraints.

In both cases, a module hierarchy (covered elsewhere) is typically used to expose only compatible modules after a core module (e.g. compiler) is loaded.

Hierarchical Module Layout

A common scheme is a three‑level hierarchy:

Core level
Only architecture‑independent modules: base compilers, Python, basic tools.
Compiler level
Libraries and tools built for a specific compiler become visible only after you load that compiler.
Example:

Load: module load gcc/13.2
Now you see: hdf5/1.14-gcc, fftw/3.3.10-gcc.

MPI level
MPI‑dependent libraries appear only after an MPI module is loaded.
Example:

Load: module load gcc/13.2
Load: module load openmpi/5.0
Now you see: petsc/3.20-gcc-openmpi, hdf5/1.14-gcc-openmpi-parallel.

This structure encodes the stack’s dependency graph into the way software is discovered and selected.

Architecture-Specific Substacks

On heterogeneous clusters, multiple architecture variants may exist:

CPU type (e.g. Intel vs AMD, AVX2 vs AVX‑512).
GPU vs non‑GPU nodes.
Different network fabrics.

You may encounter modules or prefixes that indicate architecture, for example:

foss/2024b-skylake vs foss/2024b-zen4
cuda/12.4 or gpu subtrees that are only useful on GPU nodes.

For reproducibility, recording the exact architecture‑specific stack is as important as the software versions.

Software Stacks and Reproducibility

Software stacks are central to reproducible HPC workflows. The same source code can behave differently or produce slightly different results depending on:

Compiler and its optimization defaults.
MPI implementation and configuration.
Choice and version of math libraries.
BLAS/LAPACK implementation (OpenBLAS vs MKL vs vendor BLAS).
GPU toolchains and driver versions.

Why Pinning the Stack Matters

Re-running an experiment a year later often fails if you only recall:

“I used module load gcc and ran my_code.”

Instead, reproducibility requires a more precise capture of the entire stack, e.g.:

Toolchain: foss/2024b
Extra modules: hdf5/1.14.3-foss-2024b, petsc/3.20.1-foss-2024b, python/3.12-foss-2024b
GPU stack: cuda/12.4, cudnn/9.2 (if used)
Architecture or partition: partition=skylake on SLURM.

Even if module names change, this information guides you (or an admin) to reconstruct an equivalent environment.

Stack Drift and Software Lifecycles

Over time, clusters:

Introduce new toolchains (e.g. 2024a, 2024b).
Deprecate older ones.
Apply security and bug‑fix updates that may slightly change behavior.

HPC centers often:

Maintain multiple stack generations in parallel (e.g. “2023 stack” and “2024 stack”).
Announce end‑of‑life dates for old stacks.
Encourage users to migrate codes and workflows to newer stacks.

For long‑running projects, it can be wise to:

Standardize on a specific stack version (e.g. foss/2023b).
Freeze your workflows and record exact module sets.
Plan for periodic re‑validation when moving to a new stack generation.

Strategies for Working with Software Stacks

This section focuses on practical ways to interact with the stack so your work remains manageable and reproducible.

Use Recommended / Default Stacks When Possible

Most centers publish “recommended environments”:

e.g. “Use foss/2024b for general CPU work”
e.g. “Use nvhpc/24.7 + cuda/12.4 for GPU codes on partition gpu”

Benefits:

Better support from admins and documentation.
More testing and fewer surprises.
Easier to find examples and teaching materials.

Unless you have strong reasons to deviate, start with these defaults.

Record Your Stack Automatically

In your job scripts and analysis notebooks, record the active environment:

Save module list:

  module list 2>&1 | tee modules_used.txt

  module -t list > modules_used.txt

Store environment variables:

  env | sort > environment.txt

Log compiler/MPI details from your code (e.g., calling MPI_Get_library_version or printing __VERSION__ macros in C/C++).

These logs become part of your reproducibility record alongside input data and code.

Encapsulate Stacks in Setup Scripts

Rather than relying on interactive module load sequences you might forget, create a small setup script:

# myproject_env.sh
module purge
module load foss/2024b
module load hdf5/1.14.3-foss-2024b
module load petsc/3.20.1-foss-2024b
module load python/3.12-foss-2024b

Then:

In your shell: source myproject_env.sh
In job scripts: source /path/to/myproject_env.sh

This ensures you, your collaborators, and your future self always use the same stack.

Avoid Mixing Incompatible Substacks

Some common pitfalls:

Loading libraries from different toolchains simultaneously (e.g. gcc-built HDF5 with intel-built NetCDF).
Mixing MPI‑enabled and serial versions of the same library.
Combining CPU‑only modules with GPU‑only builds incorrectly.

Typical symptoms:

Link‑time errors (undefined references, symbol clashes).
Run‑time crashes or subtle numerical differences.

Mitigation:

Respect the module hierarchy: load a compiler, then MPI, then only modules that appear under that hierarchy.
Prefer single, coherent toolchains (e.g. foss or intel) instead of cherry‑picking pieces from different ones.
If unsure, ask for the supported combination for your application.

When the Stack Does Not Have What You Need

Even with large stacks, you may need:

A newer version of a library.
A niche tool.
A specific build configuration.

Options (from least to most isolated):

User‑level builds on top of the site stack

Compile in your home or project directory, using compilers and MPI from the system modules.
Advantage: benefits from tuned compilers and math libraries.
Keep build scripts and module load recipes under version control.

Per‑project Python / R environments

Create virtualenv, conda envs, or R libraries that use the system toolchain underneath.
Be explicit about how those environments were created (YAML lock files, requirements.txt, etc.).

Containers (covered elsewhere)

Wrap or mirror the cluster’s stack into a Singularity/Apptainer image.
Good for shipping the same environment across multiple systems or preserving older stacks past their retirement date.

Whichever path you choose, tie it clearly to the underlying stack to aid reproducibility.

Software Stack Management Tools

Many centers use meta‑tools to help define and maintain software stacks. Understanding their role can help you interpret module naming and versioning.

EasyBuild, Spack, and Similar Tools

Common examples:

EasyBuild
Declarative “easyconfigs” describing how software should be built for particular toolchains and architectures.
Often used to create module hierarchies and named toolchains (foss, intel, etc.).
Spack
Package manager specialized for HPC that can generate modules and manage many dependency versions and variants (+mpi, ~cuda, etc.).

As an end‑user:

You may see module names that reflect these tools’ conventions (e.g. hdf5/1.14.3-gompi-2024b).
Some centers allow advanced users to run Spack themselves to build personal stacks.
Even if you don’t use these tools directly, they structure much of the site’s stack.

Site Policies and Documentation

HPC centers usually document:

Which stacks are supported.
Which stacks are deprecated.
Recommended stacks per use case (CPU vs GPU, MPI vs shared‑memory, etc.).
Example job scripts that assume a particular stack.

Consulting these documents helps you avoid unsupported configurations that might break silently or be hard to debug.

Designing Your Own “Logical” Software Stack

On top of the physical stack provided by the cluster, you often want to define a logical stack tailored to your project.

Characteristics:

Purpose‑built
Contains only the software needed for that project (compilers, MPI, libraries, Python/R packages).
Version‑fixed
You explicitly pin versions rather than always taking the latest module.
Reproducibly described
You maintain a text or code description of the stack (e.g. a shell script, Makefile, or configuration file).

A simple example structure for a project:

env/

modules.sh – loads cluster modules (toolchain, libraries).
python_env.yml – defines a Conda or virtualenv.
R_packages.txt – records installed R packages.

build/

CMakePresets.json or Makefile linking against the chosen stack.

docs/

stack.md – a short description: which cluster, which partition, which module set.

This logical stack bridges the gap between the site‑wide stack and the exact environment needed for your specific workflows.

Summary

An HPC software stack is the layered set of system software, toolchains, libraries, and applications that work together to form your computing environment.
Stacks are structured around compilers, MPI implementations, and architectures, usually exposed via a hierarchical module system.
Reproducibility depends strongly on capturing your exact stack: toolchains, library versions, architecture, and how they were combined.
Use recommended stacks when possible, and record your environment via scripts and logs.
Avoid mixing incompatible substacks; adhere to module hierarchies and supported combinations.
When site stacks are insufficient, build on top of them with user‑level installs or containers, documenting the relationship carefully.
For each project, design a logical stack description that can be recreated, shared, and version‑controlled as part of your reproducible workflow.

Comments

Please login to add a comment.

Don't have an account? Register now!