Kahibaro
Discord Login Register

15.1 Software stacks

What is an HPC Software Stack?

In an HPC context, a software stack is the ordered collection of software layers that turn bare metal (or cloud instances) into a usable scientific computing environment.

You can think of it as several layers, from bottom to top:

  1. Hardware
    CPUs, GPUs, interconnects, storage.
  2. System software
    OS, kernel, device drivers, low‑level libraries (e.g. glibc).
  3. Core HPC infrastructure
    Resource manager / job scheduler, parallel filesystems, monitoring.
  4. Programming toolchain
    Compilers, MPI libraries, math libraries, build tools, debuggers, profilers.
  5. Domain and application layer
    Simulation codes, analysis tools, Python/R environments, domain‑specific frameworks.

A cluster’s software stack is how these layers are selected, versions pinned, built, and made available in a consistent way to all users and nodes.

This chapter focuses on how these stacks are organized and what that means for your day‑to‑day work and reproducibility.

Typical Layers in an HPC Software Stack

While implementations differ between sites, most HPC software stacks share similar structural elements.

System and Core Libraries

These are usually provided and managed by the system administrators:

As a user, you rarely modify this layer, but it constrains what higher versions of compilers and libraries can be built on top of it.

Programming Toolchains

A toolchain is a coherent set of compiler, MPI, and basic support libraries that are tested to work together.

Typical components:

Sites often define named toolchains, for example:

These logical groupings simplify loading consistent modules and prevent incompatible combinations.

Numerical and Scientific Libraries

Built on top of specific toolchains, these are the math and domain libraries you link to or import:

Because these libraries often need to be compiled for each compiler/MPI combination, clustering them around toolchains reduces complexity. You might see multiple builds, such as:

The name encodes the underlying stack, which is crucial for reproducibility.

Languages, Runtimes, and Environments

Beyond C/C++/Fortran, stacks provide higher‑level environments:

Cluster‑provided environments help avoid conflicts between user‑installed packages and underlying system libraries.

Domain-Specific Stacks

On top of all this, many sites organize thematic or domain stacks:

These packages depend on the underlying compilers, MPI, and math libraries; they are built in ways that fit the system architecture and performance priorities.

How Software Stacks Are Organized on Clusters

Although each site has its own policies, some patterns are very common.

Centralized vs. Layered Stacks

Two broad approaches:

  1. Monolithic / “one big stack”
    A small number of recommended “official” environments (e.g. intel/2024 and foss/2024) with most software built for those only.
    • Easier for admins to maintain and test.
    • Simpler choices for users.
    • Less flexibility if you need something unusual.
  2. Layered / modular stacks
    Many combinations of compilers, MPI, and libraries exposed via modules.
    • Very flexible.
    • Can be confusing for beginners.
    • Higher risk of users mixing incompatible modules unless the hierarchy enforces constraints.

In both cases, a module hierarchy (covered elsewhere) is typically used to expose only compatible modules after a core module (e.g. compiler) is loaded.

Hierarchical Module Layout

A common scheme is a three‑level hierarchy:

This structure encodes the stack’s dependency graph into the way software is discovered and selected.

Architecture-Specific Substacks

On heterogeneous clusters, multiple architecture variants may exist:

You may encounter modules or prefixes that indicate architecture, for example:

For reproducibility, recording the exact architecture‑specific stack is as important as the software versions.

Software Stacks and Reproducibility

Software stacks are central to reproducible HPC workflows. The same source code can behave differently or produce slightly different results depending on:

Why Pinning the Stack Matters

Re-running an experiment a year later often fails if you only recall:

Instead, reproducibility requires a more precise capture of the entire stack, e.g.:

Even if module names change, this information guides you (or an admin) to reconstruct an equivalent environment.

Stack Drift and Software Lifecycles

Over time, clusters:

HPC centers often:

For long‑running projects, it can be wise to:

Strategies for Working with Software Stacks

This section focuses on practical ways to interact with the stack so your work remains manageable and reproducible.

Use Recommended / Default Stacks When Possible

Most centers publish “recommended environments”:

Benefits:

Unless you have strong reasons to deviate, start with these defaults.

Record Your Stack Automatically

In your job scripts and analysis notebooks, record the active environment:

  module list 2>&1 | tee modules_used.txt

or

  module -t list > modules_used.txt
  env | sort > environment.txt

These logs become part of your reproducibility record alongside input data and code.

Encapsulate Stacks in Setup Scripts

Rather than relying on interactive module load sequences you might forget, create a small setup script:

# myproject_env.sh
module purge
module load foss/2024b
module load hdf5/1.14.3-foss-2024b
module load petsc/3.20.1-foss-2024b
module load python/3.12-foss-2024b

Then:

This ensures you, your collaborators, and your future self always use the same stack.

Avoid Mixing Incompatible Substacks

Some common pitfalls:

Typical symptoms:

Mitigation:

When the Stack Does Not Have What You Need

Even with large stacks, you may need:

Options (from least to most isolated):

  1. User‑level builds on top of the site stack
    • Compile in your home or project directory, using compilers and MPI from the system modules.
    • Advantage: benefits from tuned compilers and math libraries.
    • Keep build scripts and module load recipes under version control.
  2. Per‑project Python / R environments
    • Create virtualenv, conda envs, or R libraries that use the system toolchain underneath.
    • Be explicit about how those environments were created (YAML lock files, requirements.txt, etc.).
  3. Containers (covered elsewhere)
    • Wrap or mirror the cluster’s stack into a Singularity/Apptainer image.
    • Good for shipping the same environment across multiple systems or preserving older stacks past their retirement date.

Whichever path you choose, tie it clearly to the underlying stack to aid reproducibility.

Software Stack Management Tools

Many centers use meta‑tools to help define and maintain software stacks. Understanding their role can help you interpret module naming and versioning.

EasyBuild, Spack, and Similar Tools

Common examples:

As an end‑user:

Site Policies and Documentation

HPC centers usually document:

Consulting these documents helps you avoid unsupported configurations that might break silently or be hard to debug.

Designing Your Own “Logical” Software Stack

On top of the physical stack provided by the cluster, you often want to define a logical stack tailored to your project.

Characteristics:

A simple example structure for a project:

This logical stack bridges the gap between the site‑wide stack and the exact environment needed for your specific workflows.

Summary

Views: 39

Comments

Please login to add a comment.

Don't have an account? Register now!