Kahibaro
Discord Login Register

Reproducibility and Software Environments

Why Reproducibility Matters in HPC

In HPC, “it runs on my machine” is not enough. You typically:

Reproducibility means someone else (or you in the future) can:

In HPC, reproducibility has several dimensions:

Perfect bit‑for‑bit reproducibility is not always realistic (e.g., due to floating‑point order of operations, different MPI layouts), but you should aim for:

Components of a Reproducible HPC Setup

Reproducibility in HPC is built from several layers. Other chapters go into technical detail; here we focus on how these pieces contribute to reproducible work.

Software and Library Versions

Your results can depend subtly on:

To be reproducible, you need to:

Key practices:

Environment Configuration

Even with the same software versions, differences in environment can change behavior:

For reproducibility:

Data, Inputs, and Configuration

Reproducible results also require:

Good habits:

Workflow and Orchestration

HPC workflows often consist of multiple steps:

  1. Preprocess or generate input data.
  2. Run a simulation or analysis.
  3. Postprocess results.
  4. Visualize or summarize.

Each step should be scriptable and documented. Avoid:

Instead:

Strategies for Practical Reproducibility on HPC Systems

Use the Cluster’s Environment Modules Consistently

Environment modules (covered in detail elsewhere) are central to software reproducibility on many HPC systems.

For reproducible workflows:

Example pattern in a job script:

#!/bin/bash
#SBATCH --job-name=my_sim
#SBATCH --ntasks=64
#SBATCH --time=02:00:00
module purge
module load gcc/12.2.0
module load openmpi/4.1.5
module load mkl/2024.1
module list  # useful for logging
export OMP_NUM_THREADS=1
srun ./my_simulation input.in

Key ideas:

Version Control for Code and Configuration

Reproducibility requires knowing:

Using a version control system like git helps:

Typical workflow:

echo "Git commit: $(git rev-parse HEAD)" >> run_metadata.txt

If your code is too large or includes generated files, at minimum:

Capturing the Execution Environment

To make a run reproducible, you need to capture:

Practical techniques:

Example snippet inside a job script:

OUTDIR=results/$(date +%Y%m%d_%H%M%S)
mkdir -p "$OUTDIR"
{
  echo "Job ID: $SLURM_JOB_ID"
  echo "Date: $(date)"
  echo "Host(s): $SLURM_NODELIST"
  echo "Git commit: $(git rev-parse HEAD 2>/dev/null || echo 'unknown')"
  echo "Modules:"
  module list 2>&1
  echo "Command: srun ./my_simulation input.in"
} > "$OUTDIR/run_metadata.txt"
srun ./my_simulation input.in > "$OUTDIR/output.log"

This makes it much easier to reconstruct or debug runs later.

Parameter and Experiment Management

For exploratory work, you may run many jobs with different parameters. To keep this reproducible:

Example config file (config.yaml):

simulation:
  nx: 256
  ny: 256
  timesteps: 1000
  dt: 0.01
physics:
  viscosity: 1e-3
  forcing: true

Your driver script reads config.yaml and writes its contents into run_metadata.txt for every job.

Numerical and Parallel Considerations

Even with the same software environment, parallel runs may show small differences due to:

Strategies:

The goal is scientific reproducibility: results are numerically consistent enough that conclusions don’t change.

Documentation and Reporting for Reproducibility

Minimum Information to Record for Each Study

For any significant set of results (a paper figure, report, or milestone), you should be able to state:

Even a plain text file or README.md in your results directory with this information significantly improves reproducibility.

Automating Documentation

Manual documentation is error‑prone. You can:

Example snippet to capture environment:

env | sort > "$OUTDIR/environment.txt"

This systematically preserves important context.

Reproducibility Across Systems

You might develop on one system and run on another, or move between clusters as resources change.

Challenges:

To remain reproducible across systems:

A reasonable goal is environment portability: you can describe your software environment generically enough that an HPC support team, or a future you, can recreate it on a new system.

Practical Workflow Example

A simple, reproducible HPC workflow might look like this:

  1. Code and configuration under version control
    • Your solver code, job scripts, and config files are in a git repository.
  2. Environment setup script
    • setup_env.sh contains:
     module purge
     module load gcc/12.2.0
     module load openmpi/4.1.5
     module load mkl/2024.1
  1. Standard job script template
    • Every job script:
      • Creates a timestamped output directory.
      • Copies the config file and job script into it.
      • Logs modules, git commit, environment, and parameters.
  2. Experiment log
    • A experiments.csv file in the repo where each line includes:
      • Experiment ID.
      • Config file used.
      • Git commit.
      • Job ID.
      • Output directory.
      • Brief notes.
  3. Postprocessing scripts
    • Separate scripts generate figures and tables from raw outputs.
    • These scripts are also version‑controlled and logged.

This level of structure, while simple, is usually enough to reproduce and audit your own work months later and to share clear instructions with collaborators.

Mindset and Best Practices

Reproducibility is not an afterthought; it is part of how you design and run HPC work:

The later sections on environment modules, software stacks, and containers will introduce more advanced tools to help you formalize these ideas. Here, the key is to adopt the habit of thinking: “If I had to redo this on a clean system in 6 months, what would I need to know and have scripted today?”

Views: 25

Comments

Please login to add a comment.

Don't have an account? Register now!