15 Reproducibility and Software Environments

Table of Contents

Why Reproducibility Matters in HPC

Reproducibility in high performance computing means that someone else, or you in the future, can run the same workflow and obtain results that are consistent with the original ones, given the same inputs and conditions. In HPC this is especially important because simulations often inform scientific conclusions, engineering decisions, or business strategy, and they can be expensive to run again.

On a personal level, reproducibility protects you from your own forgetfulness. If you cannot reconstruct which code version, which compiler, and which libraries you used, it may be impossible to understand differences between two runs. On a broader level, reproducibility is central to scientific credibility. Many HPC results are impossible to verify by hand, so a clear, repeatable computational story is the main way to build trust.

HPC environments add some unique challenges. Cluster software stacks evolve, hardware is heterogeneous, and jobs are dispatched by schedulers. You often do not control system updates, and different users may see different module combinations. Reproducibility in this context is therefore not automatic. It requires deliberate use of software environment tools, documentation, and sometimes encapsulation technologies such as containers.

Reproducible HPC work always specifies:

Code version.
Software environment and dependencies.
Input data and parameters.
Execution environment and resources.
Exact steps to run and postprocess.

The rest of this chapter focuses on the software environment part of this list, and how to control and record it so that runs can be faithfully reproduced.

The Idea of a Software Environment

A software environment is the collection of compilers, interpreters, libraries, tools, configuration variables, and sometimes hardware specific settings that together determine how your code is compiled and executed. On an HPC system this might include which MPI implementation you use, which BLAS library is linked, which Python version runs your scripts, and which CUDA toolkit is present.

Two runs of the same source code are not meaningfully comparable if they use very different software environments. Even small differences in compiler versions, optimization flags, or math libraries can change numerical behavior, performance, and sometimes the qualitative outcome. For instance, using a different BLAS library can change floating point summation order, which affects roundoff error. Depending on the sensitivity of your simulation, this might or might not matter.

In shared HPC systems the default environment is often minimal, and users construct working environments using mechanisms such as environment modules, virtual environments, and containers. From a reproducibility perspective, your goal is not just to get a working environment once, but to define it in a way that can be recreated later and by other people.

Sources of Irreproducibility in Software Environments

Several common patterns in HPC make otherwise sensible workflows hard to reproduce.

First, the implicit use of system defaults creates ambiguity. If you compile with mpicc without knowing which MPI implementation and version it refers to, or you run python without specifying whether it is the system Python or a module, then reproducing the run on another system is difficult. The default can change when the system is upgraded.

Second, interactive experimentation can leave you with an accidental environment. You might load and unload modules, install Python packages in your home directory, adjust PATH and LD_LIBRARY_PATH, and eventually get something that works. If you never record the sequence of steps or freeze the resulting configuration, that environment is effectively unique and fragile. Logging commands helps, but a scripted setup is more robust.

Third, using unpinned versions of external dependencies can cause silent drift. If your workflow says “install NumPy” without stating a version, package managers may install whatever is current at the time. Over months or years, you can end up running subtly different code. In compiled environments, linking to a generic MKL or OpenBLAS module without recording its version leads to similar drift.

Finally, hardware sensitive behavior, in particular floating point non determinism across thread or process counts, interacts with software choices. Some math libraries and compilers provide flags and settings to trade reproducibility for performance. If you accept defaults without documenting them, results may differ when a system upgrade changes those defaults.

Capturing and Describing Software Environments

To make an HPC workflow reproducible you must be able to describe and, ideally, regenerate its software environment. This description should be precise but not fragile. It should include specific versions where they matter, and should be captured as text so that it can be stored in version control and shared.

On module based systems, a simple starting point is to record the output of module list when you compile and run. This usually shows which compiler, MPI, math libraries, and tools are active. Combining this with env or printenv gives a snapshot of relevant environment variables. While this snapshot is not directly reusable as a script, it is a useful record.

For custom Python environments, tools such as pip freeze or conda env export generate machine readable descriptions of installed packages. These can serve as reproducible specifications of the user level parts of your environment. Similarly, for system level package managers you can record package lists. On some HPC systems, you may not have direct access to such managers, so modules are your main mechanism.

The most useful form of description is one that also acts as a recipe. For example, a shell script that loads specific modules and sets key variables, or a conda environment file that reproduces the same Python stack. In container based workflows, the Dockerfile or definition file plays this role. You should store these recipes in the same repository as your source code so that environment and code evolve together.

A reproducible software environment is best described by:

Explicit version numbers, not “latest.”
A script or configuration file, not only prose.
Output of diagnostic commands such as module list, env, or pip freeze saved with the run.

Using Environment Modules for Reproducibility

Most HPC clusters use an environment modules system, often with commands like module load and module list. A dedicated chapter focuses on basic usage. Here the concern is how to use modules in a reproducible way.

Modules provide named collections of settings that adjust your environment. They control paths to compilers, libraries, and tools. Without discipline, a long interactive session with many module load and module unload operations can create a one off environment. Instead, you should think in terms of explicit starting points and reproducible sequences.

One simple pattern is to begin every new shell with a clean module state, for example by running module purge. From this well defined baseline, you load only the modules that your workflow requires. You can then place the necessary module load commands in a small script, such as setup_environment.sh, and source it before compiling or running.

Because modules are versioned, avoid generic names when possible. Prefer module load gcc/12.2.0 to module load gcc. This prevents unexpected changes when the system default version is updated. If your cluster uses module collections or saved environments, you can also record a particular set of modules and reload it later, although scripts are more portable.

Modules interact with job schedulers. In batch scripts for SLURM or other systems, you should load the same modules you used during compilation, at the top of the job script. Relying on your login shell environment to carry over into batch execution is unsafe, because job environments are usually fresh shells. If your job script is the authoritative document for which modules are needed, then reproducing a run becomes as simple as reusing that script.

Recording the full module state for important runs helps with long term reproducibility. Some users add lines to their job scripts that run module list and redirect the output to a log file in the job’s output directory. This provides a concrete record even if the module tree later changes.

Versioning and Locking Dependencies

Reproducibility improves when you treat dependencies with the same care as your own source code. This often means explicit versioning and some form of locking. In HPC, there are several layers to consider.

At the library level, link against specific versions of key libraries where possible. For example, if your cluster provides several BLAS or FFT libraries through modules, record which one you use and its version. At the language runtime level, specify which Python, R, or Julia release is required. For Python, you can maintain a requirements.txt file with pinned versions like numpy==1.26.3 instead of numpy>=1.20.

At the toolchain level, compiler and MPI versions matter not only for performance but also for behavior and binary compatibility. It is common to treat a particular combination, for instance gcc/12.2.0, openmpi/4.1.5, and cuda/12.1, as part of your project’s specification. If you upgrade any component, you should rebuild and retest your code, and consider this a new configuration.

Locking dependencies means that you have some file that, when applied to an appropriate base system, reconstructs the same environment. In Python ecosystems, this might be a requirements.txt or a more detailed lock file. In conda based workflows, an environment.yml file with exact build identifiers can serve this role. For containers, the definition file itself is the lock for the environment inside the image.

Locking everything perfectly can be difficult on shared systems because you cannot freeze system libraries or the module tree. The goal is not to control the entire cluster, but to be precise about the parts you rely on and to detect when they change. For critical workflows, some teams maintain their own set of modules or toolchains, often installed in project specific directories, to minimize surprises.

Containers and Portable Environments

Containers provide a way to bundle an entire user space environment, including compilers, libraries, runtimes, and sometimes small datasets, into a single image file. On HPC systems, technologies like Singularity or Apptainer are used to run such images within the constraints of batch schedulers and parallel file systems. A separate chapter introduces these tools. Here the focus is their role in reproducibility.

A container image can capture almost everything above the kernel, so that your code sees the same environment regardless of the underlying cluster. This includes the versions of language runtimes, math libraries, and even tools like cmake. If you build and tag an image as mycode:1.0, then you, or someone else, can run the same image in the future, and expect the software environment inside it to be unchanged.

For long term reproducibility, the important object is not just the image itself, but also the recipe you used to build it, such as a Dockerfile or Singularity definition file. That recipe should explicitly specify versions of base images and packages. For example, you might start from ubuntu:22.04 and install gcc-12 and particular library releases. As with other environments, you want this to be text stored in version control.

Containers do not replace the need to work with the cluster’s environment, because you still interact with the scheduler, interconnect, and file systems. HPC container runtimes often integrate with host MPI stacks or GPU drivers. For reproducibility, you need to pay attention to how the container is launched. The exact srun or mpirun command, along with container runtime flags, should be recorded with the job.

There is a trade off between portability and performance. Highly tuned vendor libraries and MPI implementations provided by the cluster may not be fully encapsulated in generic containers. Many HPC workflows therefore use hybrid approaches where the container provides the application and high level dependencies, while critical low level components, such as MPI or GPU drivers, are used from the host. In such cases, you must still treat those host components as part of your reproducibility story.

For container based reproducibility, always keep:

The container image, tagged with a clear version.
The container definition file in version control.
The exact launch command used on the HPC system.

Recording Computational Experiments

Reproducibility extends beyond having the right software environment at a single point in time. You also need a record of which environment was used in each computational experiment. In HPC this can be organized around job scripts, log files, and version controlled configuration.

A simple and effective pattern is to view the batch job script as the canonical specification of how a run is performed. The script can load the correct modules, activate a virtual environment or container, set environment variables, and finally launch the application with specified inputs and resources. By associating each significant run with a copy of the job script, possibly parameterized, you document both environment and execution.

Inside your application, it is useful to record metadata about each run. Many simulation codes write headers in their output files that include a timestamp, the code version, and sometimes compiler and library information. You can extend this idea by capturing selected environment variables or output of diagnostic commands and writing them into the output directory. For example, your code could run git rev-parse HEAD to record the current commit hash, or read OMP_NUM_THREADS and include it in the run header.

Even without modifying the code, you can automate metadata capture in wrapper scripts. For instance, before launching the main executable, a shell script can write the code hash, module list output, and job scheduler variables such as job ID and node list to a small text file in the output directory. These files can then be associated with results in analysis or publication.

For parametric studies and large sweeps, managing many runs can become complex. Workflow managers and scripting frameworks, which are covered elsewhere, often include facilities for recording run configurations. Regardless of tools, the guiding principle is that no important run should depend only on your memory for its settings.

Reproducible Workflows Across Systems

HPC work often moves between environments. You might develop and test on a laptop or small server, then scale up on a large cluster. Collaborators might use different institutions’ systems. Reproducibility in this context means not only repeating runs on one machine, but also reconstructing them elsewhere in a comparable way.

To achieve this, you need abstraction layers in your workflow. For example, instead of hardcoding absolute paths and cluster specific module names in your code, you can separate configuration into files or scripts that adapt to each system. The structure of your environment recipe and job scripts can be similar across systems, even if the specific modules or container images differ.

Containers are particularly helpful here, because they allow you to carry most of your environment from one cluster to another. However, since not all HPC systems support the same container runtimes or host libraries, you should still provide alternative, module based setup scripts when possible. At minimum, document equivalent setups for the main systems you target.

An important concept is that of “scientific equivalence” rather than bitwise identity. Due to differences in hardware, operating systems, or compilers, two systems may not produce identical binary outputs. Instead, you aim for results that are numerically consistent within expected tolerances and that lead to the same scientific conclusions. For reproducibility you therefore need both technical environment control and domain appropriate validation tests to check that ported workflows behave acceptably.

Best Practices for Sustainable Reproducibility

Reproducibility is easiest if you treat it as part of your regular development and execution habits rather than as a separate, occasional task. Small, consistent practices accumulate into robust, sustainable workflows.

One cornerstone is the combined use of version control for code and text based environment descriptions. Every important change to your computational environment should be reflected in the repository, whether as updated module scripts, requirement files, or container recipes. Tagging code releases that correspond to published results, and preserving the associated environment specifications, makes later verification straightforward.

Another practice is regular verification. When the cluster environment changes, for example after a scheduled upgrade of compilers or libraries, rerun a small, representative test case and compare results to established references. If differences arise, investigate whether they are benign or indicate a change in numerical behavior. Keep such test cases small enough that they can be run frequently.

Finally, treat documentation as an integral part of your workflow. A plain text README in the project’s repository section for “how to run on cluster X” that describes the environment setup and run procedure is more valuable than elaborate, rarely updated documentation elsewhere. As personnel or institutions change, these concise, accurate notes are often what allow future users to reproduce and extend your work.

Reproducible HPC workflows rely on:

Scripted, versioned environment setup.
Recorded run metadata and code versions.
Periodic verification after environment changes.

By combining environment control tools, clear recording of configurations, and disciplined workflow practices, you can make your HPC work not only reproducible for others, but also tractable and trustworthy for yourself over the long term.

15.1 Software stacks

15.2 Containers in HPC

15.3 Introduction to Singularity and Apptainer

15.4 Best practices for reproducible workflows