Table of Contents
Why Reproducibility Matters in HPC
In HPC, “it runs on my machine” is not enough. You typically:
- Develop on a laptop or workstation.
- Run on a shared cluster with many users.
- Depend on complex software stacks and fast‑moving libraries.
- Run long, expensive jobs that you may need to re‑run months or years later.
Reproducibility means someone else (or you in the future) can:
- Rebuild the same software environment.
- Re‑run the same workflow.
- Obtain the same (or explainably similar) results.
In HPC, reproducibility has several dimensions:
- Software reproducibility – same compilers, libraries, tools, and versions.
- Environment reproducibility – same modules, environment variables, paths.
- Workflow reproducibility – same scripts, parameters, and data transformations.
- Hardware/cluster context – same or comparable architecture and scheduler settings.
Perfect bit‑for‑bit reproducibility is not always realistic (e.g., due to floating‑point order of operations, different MPI layouts), but you should aim for:
- Practical reproducibility – same inputs and environment yield results within expected numerical tolerance, and the process is clearly documented and repeatable.
Components of a Reproducible HPC Setup
Reproducibility in HPC is built from several layers. Other chapters go into technical detail; here we focus on how these pieces contribute to reproducible work.
Software and Library Versions
Your results can depend subtly on:
- Compiler version and flags.
- MPI implementation and version.
- Math and I/O libraries (BLAS, LAPACK, FFT, HDF5, NetCDF, etc.).
- Python/R/Julia/… interpreter versions and packages.
To be reproducible, you need to:
- Know exactly which versions you used.
- Be able to reconstruct those versions later.
Key practices:
- Prefer module‑provided software versions on the cluster rather than ad‑hoc installations in your home directory.
- Avoid “floating” dependencies (e.g.,
pip install packagewith no version pin); instead, specify versions (e.g.,package==1.3.2). - For compiled codes, record:
- Compiler and version (e.g.,
gcc 12.2.0,Intel oneAPI 2024). - MPI implementation (e.g.,
OpenMPI 4.1.5,Intel MPI 2021.10). - Relevant math libraries (e.g.,
MKL 2024.1,OpenBLAS 0.3.24).
Environment Configuration
Even with the same software versions, differences in environment can change behavior:
- Environment variables (e.g.,
OMP_NUM_THREADS,MKL_NUM_THREADS,LD_LIBRARY_PATH,PATH). - Module load order and conflicts.
- Threading/affinity settings, GPU visibility, etc.
For reproducibility:
- Avoid relying on “whatever the login node gives you by default”.
- Always set your environment explicitly in job scripts and setup scripts.
- Try to keep your environment minimal and controlled:
- Load only the modules you actually need.
- Unload or purge default modules if they interfere.
Data, Inputs, and Configuration
Reproducible results also require:
- The same input data (or a stable, versioned source of it).
- The same configuration (parameters, flags, input files).
Good habits:
- Keep input decks, configuration files, and run scripts under version control.
- Treat input data like code: version it, or at least record checksums and locations.
- Avoid manually editing large input files on the fly without tracking changes.
Workflow and Orchestration
HPC workflows often consist of multiple steps:
- Preprocess or generate input data.
- Run a simulation or analysis.
- Postprocess results.
- Visualize or summarize.
Each step should be scriptable and documented. Avoid:
- Manual, undocumented interactive steps.
- Copy‑pasting commands from memory into the shell.
Instead:
- Use simple scripts (
bash, Python, etc.) to chain steps. - Make scripts idempotent where possible (re‑running them produces the same results, or clearly defined updates).
- Parameterize scripts via config files or command‑line arguments, rather than hard‑coding settings.
Strategies for Practical Reproducibility on HPC Systems
Use the Cluster’s Environment Modules Consistently
Environment modules (covered in detail elsewhere) are central to software reproducibility on many HPC systems.
For reproducible workflows:
- Record exactly which modules you load.
- Automate module loading in scripts instead of doing it manually.
Example pattern in a job script:
#!/bin/bash
#SBATCH --job-name=my_sim
#SBATCH --ntasks=64
#SBATCH --time=02:00:00
module purge
module load gcc/12.2.0
module load openmpi/4.1.5
module load mkl/2024.1
module list # useful for logging
export OMP_NUM_THREADS=1
srun ./my_simulation input.inKey ideas:
module purgeclears whatever random modules are present by default.- A fixed, ordered list of
module loadcommands defines your environment. module listin the job output creates an automatic record of the environment used for that run.
Version Control for Code and Configuration
Reproducibility requires knowing:
- Exactly what code you ran.
- What changes were made over time.
Using a version control system like git helps:
- Track changes in source code, scripts, and configuration files.
- Tag or branch specific states of the code associated with key results.
Typical workflow:
- Commit your code and scripts regularly.
- Use descriptive commit messages.
- Tag important states, e.g.,
v1.0-paper-results. - In your job scripts, log the current commit:
echo "Git commit: $(git rev-parse HEAD)" >> run_metadata.txtIf your code is too large or includes generated files, at minimum:
- Keep all human‑written source and configuration under version control.
- Exclude build artifacts and large raw data.
Capturing the Execution Environment
To make a run reproducible, you need to capture:
- What you ran.
- How you ran it.
- Where it ran.
Practical techniques:
- Save job scripts (
.slurm,.pbs, etc.) with meaningful names and dates. - Log metadata to a simple text file alongside your outputs:
- Date/time.
- Cluster name and partition/queue.
- Node list if relevant.
- Job ID.
- Module list.
- Git commit.
- Input files and key parameters.
Example snippet inside a job script:
OUTDIR=results/$(date +%Y%m%d_%H%M%S)
mkdir -p "$OUTDIR"
{
echo "Job ID: $SLURM_JOB_ID"
echo "Date: $(date)"
echo "Host(s): $SLURM_NODELIST"
echo "Git commit: $(git rev-parse HEAD 2>/dev/null || echo 'unknown')"
echo "Modules:"
module list 2>&1
echo "Command: srun ./my_simulation input.in"
} > "$OUTDIR/run_metadata.txt"
srun ./my_simulation input.in > "$OUTDIR/output.log"This makes it much easier to reconstruct or debug runs later.
Parameter and Experiment Management
For exploratory work, you may run many jobs with different parameters. To keep this reproducible:
- Centralize parameters in a single config file rather than editing multiple scripts.
- Give runs unique but structured names (e.g., including parameter values or labels).
- Keep a simple experiment log (spreadsheet, markdown, or text file) mapping:
- Experiment ID → parameters, code version, input data, job ID, outputs.
Example config file (config.yaml):
simulation:
nx: 256
ny: 256
timesteps: 1000
dt: 0.01
physics:
viscosity: 1e-3
forcing: true
Your driver script reads config.yaml and writes its contents into run_metadata.txt for every job.
Numerical and Parallel Considerations
Even with the same software environment, parallel runs may show small differences due to:
- Non‑associativity of floating‑point arithmetic.
- Different reduction orders across MPI ranks or threads.
- Slight differences in library implementations.
Strategies:
- For validation, compare results with tolerances:
- For scalar outputs, use relative error:
$$ \text{rel\_error} = \frac{|x_{\text{new}} - x_{\text{ref}}|}{|x_{\text{ref}}|} $$ - For fields, use norms (e.g., $L_2$ norm) and ensure they match within acceptable bounds.
- Where possible, use deterministic algorithms or fixed random seeds.
- Avoid unnecessary sources of non‑determinism (e.g., uninitialized variables, race conditions).
The goal is scientific reproducibility: results are numerically consistent enough that conclusions don’t change.
Documentation and Reporting for Reproducibility
Minimum Information to Record for Each Study
For any significant set of results (a paper figure, report, or milestone), you should be able to state:
- Code:
- Repository URL (if applicable).
- Commit hash or release tag.
- Local modifications, if any.
- Environment:
- Cluster name and OS (as far as exposed to users).
- Compiler and MPI versions.
- Key libraries (math, I/O, domain‑specific).
- Modules loaded.
- Run settings:
- Number of nodes, cores/GPUs per node.
- Threading parameters (e.g.,
OMP_NUM_THREADS). - Walltime, partition/queue.
- Inputs:
- Data source and version.
- Input file names or configs.
- Any preprocessing steps.
- Outputs:
- Where results are stored.
- Description of postprocessing and plotting scripts.
Even a plain text file or README.md in your results directory with this information significantly improves reproducibility.
Automating Documentation
Manual documentation is error‑prone. You can:
- Add a standard “metadata block” to all job scripts (like in the earlier example).
- Have your application print its build information on startup:
- Compile-time flags.
- Linked library versions (where available).
- Write small helper scripts to:
- Copy the job script itself into the output directory.
- Dump the environment (
env) into a file. - Save the config file used.
Example snippet to capture environment:
env | sort > "$OUTDIR/environment.txt"This systematically preserves important context.
Reproducibility Across Systems
You might develop on one system and run on another, or move between clusters as resources change.
Challenges:
- Different module names and versions.
- Different default compilers or MPI.
- Different file systems and paths.
To remain reproducible across systems:
- Keep system‑specific settings isolated in small wrapper scripts (e.g.,
setup_env_clusterA.sh,setup_env_clusterB.sh) that expose the same interface. - Avoid hard‑coded absolute paths; use environment variables or config files instead.
- Use containers or well‑specified software stacks (covered in later subsections) when moving between significantly different environments.
A reasonable goal is environment portability: you can describe your software environment generically enough that an HPC support team, or a future you, can recreate it on a new system.
Practical Workflow Example
A simple, reproducible HPC workflow might look like this:
- Code and configuration under version control
- Your solver code, job scripts, and config files are in a
gitrepository. - Environment setup script
setup_env.shcontains:
module purge
module load gcc/12.2.0
module load openmpi/4.1.5
module load mkl/2024.1- All job scripts
source setup_env.sh.
- Standard job script template
- Every job script:
- Creates a timestamped output directory.
- Copies the config file and job script into it.
- Logs modules, git commit, environment, and parameters.
- Experiment log
- A
experiments.csvfile in the repo where each line includes: - Experiment ID.
- Config file used.
- Git commit.
- Job ID.
- Output directory.
- Brief notes.
- Postprocessing scripts
- Separate scripts generate figures and tables from raw outputs.
- These scripts are also version‑controlled and logged.
This level of structure, while simple, is usually enough to reproduce and audit your own work months later and to share clear instructions with collaborators.
Mindset and Best Practices
Reproducibility is not an afterthought; it is part of how you design and run HPC work:
- Assume future‑you or a collaborator will need to repeat or extend what you are doing.
- Treat everything—code, environment, data, workflow—as something that needs to be described and scripted.
- Prefer transparent, text‑based configurations and scripts over manual GUI clicks or undocumented steps.
- Start small: even adding basic logging (
module list,git rev-parse HEAD,env) to your jobs is a major step forward.
The later sections on environment modules, software stacks, and containers will introduce more advanced tools to help you formalize these ideas. Here, the key is to adopt the habit of thinking: “If I had to redo this on a clean system in 6 months, what would I need to know and have scripted today?”