15.4 Best practices for reproducible workflows

Table of Contents

Why Reproducibility Matters in HPC Workflows

In HPC, “reproducible” does not only mean “same numbers on the same machine.” It typically means:

Another person
At a later time
On a different system

can rerun your work and:

Obtain the same results (within numerical tolerance),
Using clearly documented steps,
With clearly specified software and data.

HPC adds extra challenges: modules, queues, schedulers, parallelism, and evolving software stacks. Best practices aim to make your work independent of your memory, your personal environment, and your current cluster.

General Principles for Reproducible Workflows

1. Treat your workflow as code

Anything you do manually and repeatedly should be:

Written down as commands in a script.
Version-controlled like source code.

Typical artifacts:

Shell scripts for running jobs: run_simulation.sh, submit_all.sh
Job submission scripts for the scheduler.
Small “driver” programs or Python scripts that orchestrate steps.

Rules of thumb:

Never rely on “I just typed these commands in the terminal” for important results.
Avoid manual GUI steps (unless absolutely required); if you must use them, document them step by step.

2. Prefer scripts over ad-hoc commands

Instead of:

Typing a long pipeline on the shell and then forgetting it,
Manually editing multiple job scripts each time,

use:

A single master script that calls others,
Parameterized scripts (using environment variables or arguments),
Simple workflow tools (e.g. Make, Python scripts, or workflow managers—if available on your system).

Example pattern:

./prepare_input.py config.yaml
sbatch run_case.slurm
./analyze_results.py output/

You should be able to rerun the whole workflow (or key parts) with a single command once everything is set up.

3. Make runs parameter-driven, not hard-coded

Replace values that change between runs (problem size, input files, physics parameters, number of processes, etc.) with:

Command-line arguments, or
Clearly separated configuration files (e.g., .ini, .yaml, .json), or
Environment variables.

Example:

./run_experiment.sh problem_size=1024 method=GMRES

This avoids copying and editing scripts for each variant, which quickly becomes unmanageable and error-prone.

4. Make randomness reproducible

For codes that use randomness (Monte Carlo, stochastic optimization, randomized algorithms):

Expose a random seed as a parameter.
Record the seed for each run.
Avoid using default, implicit seeds if you need strict reproducibility.

If full bitwise reproducibility is not possible due to parallelism or hardware differences, document expected numerical variability and tolerance.

Capturing the Software Environment

(Parent chapters cover modules, stacks, and containers. Here we focus on using them for reproducibility.)

5. Export and record your environment

Before or during runs, record:

Loaded modules
Software versions
Compiler versions
Important environment variables

Example commands (to put into job scripts):

echo "Date: $(date)"
echo "Hostname: $(hostname)"
echo "Working directory: $(pwd)"
echo "Git commit: $(git rev-parse HEAD 2>/dev/null || echo 'not a git repo')"
echo "Loaded modules:"
module list 2>&1
echo "Environment snapshot:"
env | sort

Redirect this to a log file that you archive with the results, e.g. run_001_env.log.

6. Pin versions; avoid “floating” dependencies

Avoid unqualified “latest” versions:

Use exact version numbers in module loads, e.g. module load gcc/12.1, not module load gcc.
In package managers (Python, R, etc.), specify versions explicitly:

numpy==1.23.5, not just numpy.

Maintain:

A requirements.txt / environment.yml / renv.lock / spack.yaml, etc. for language-specific packages.
A short text file listing all key HPC modules and versions used, e.g. hpc_environment.txt.

This enables you (and others) to reconstruct the environment even when the default cluster environment changes.

7. Prefer stable, curated software stacks

Where available:

Use site-provided modules and toolchains that your HPC center maintains as stable configurations.
Avoid compiling everything from scratch unless necessary; if you must, document:

Exact source code versions (e.g. Git commit hashes),
Build options and flags,
Dependencies and their versions.

Store build scripts (or at least build command logs) in version control along with your code.

8. Use containers or environment snapshots when possible

If your center supports containers / environment managers:

Create a container or environment file that captures your software stack.
Build containers or environments from declarative recipes (e.g. definition files, environment.yml) rather than manual interactive setup.

Benefits:

Portability between clusters.
Protection from module changes and OS updates.

Document:

The container or environment name and version.
The recipe file and how to rebuild it.

Capturing Inputs, Outputs, and Configuration

9. Treat input data as versioned, immutable artifacts

For input files:

Keep a read-only “original” copy of the raw data.
Record:

File names,
Checksums (e.g. sha256sum),
Provenance: where the data came from and how it was generated.

If data is preprocessed:

Script the preprocessing.
Record both the preprocessing script and the original + processed data checksums.
Avoid manual editing of input files, or document it explicitly if it cannot be avoided.

10. Use explicit configuration files

Instead of many scattered flags and environment variables, use a configuration file that:

Lists all important parameters (problem sizes, algorithmic options, thresholds, I/O options),
Is stored with the run results,
Can be checked into version control.

Format examples: .yaml, .json, .ini, plain text key=value.

Sample pattern:

config_base.yaml (general defaults),
config_large.yaml (overrides for a larger problem),
config_gpu.yaml (overrides for GPU runs).

11. Organize run directories consistently

Use a consistent directory structure for your experiments. For example:

project/
  code/
  configs/
  runs/
    run_001/
      config.yaml
      job.slurm
      stdout.log
      stderr.log
      env.log
      results/
        ...
    run_002/
      ...

Conventions that help:

One directory per “logical” run or experiment.
Store everything needed to understand that run within its directory (or with clear links).
Use meaningful or systematically numbered run names (run_001, run_002, etc. or short descriptive names).

12. Keep raw outputs; separate analysis and plotting

To support future reinterpretation:

Save the raw numerical output (within reasonable size limits).
Avoid overwriting raw data with processed data.
Write analysis scripts that read raw outputs and generate:

Summary tables,
Figures,
Logs.

This way, if you change analysis methods or fix a bug, you can rerun the analysis on the same raw data without redoing expensive simulations.

13. Log enough metadata for each run

For every run, capture at minimum:

Code version (e.g. Git commit hash or version tag),
Configuration file used,
HPC job script,
Number of nodes/cores, GPU usage,
Timing information (start/end time, elapsed time),
Memory or scaling info if relevant.

Store this in a simple text file, e.g. metadata.txt:

code_version: 8b32c1f
config_file: config_large.yaml
job_script: job_large.slurm
nodes: 4
tasks_per_node: 32
gpus_per_node: 1
submit_time: 2025-11-02T09:13:45Z
finish_time: 2025-11-02T12:47:21Z

Version Control Practices for Workflows

14. Use version control for more than just source code

Put under version control (e.g. git):

Source code.
Job scripts.
Configuration files.
Environment / container recipes.
Analysis and plotting scripts.

Do not put:

Large raw data files (use separate data storage; consider git-annex, DVC, or center-provided tools).
Generated binary outputs (unless very small and essential).

15. Tie runs to code versions

Avoid the situation “I don’t know which version of the code produced this figure.”

Practices:

Tag releases corresponding to major result sets, e.g. v1.0-paper-A.
Include the Git commit hash in:

Program output,
Log file headers,
Filenames if convenient, e.g. results_8b32c1f.h5.

Optional automation: Have your build system embed the Git hash into the executable at compile time.

16. Make changes traceable

When you change:

Model assumptions,
Numerical methods,
Preprocessing steps,

record this in commit messages and/or a CHANGELOG.md or experiments.md. This gives future you (or collaborators) a concise narrative of what changed and why.

Scheduling and Job-Script Practices That Aid Reproducibility

(Details of job schedulers are covered elsewhere. Here we focus on conventions.)

17. Keep job scripts under version control and parameterized

Avoid manually editing a single job script for each run. Instead:

Write generic templates with placeholders or variables for:

Job name,
Runtime,
Node count,
Account/project ID,
Configuration file.

You can:

Use environment variables or a simple wrapper script to fill in these parameters.
Generate job scripts from a small driver script if needed.

18. Log scheduler and resource information

In your job script, record:

Scheduler job ID ($SLURM_JOB_ID or equivalent),
Allocated resources (nodes, tasks, memory),
Node list ($SLURM_NODELIST),
Queue/partition name.

Append this to your metadata/log files:

echo "SLURM_JOB_ID: $SLURM_JOB_ID" >> metadata.txt
echo "SLURM_NODELIST: $SLURM_NODELIST" >> metadata.txt

This helps diagnose performance differences and understand where the job actually ran.

19. Use deterministic resource specifications

Avoid relying on “whatever resources the scheduler gives me.” Instead:

Specify resources explicitly (nodes, tasks per node, memory).
Document why you chose specific resource sizes.

If you later change resource layout (e.g. from 16 to 32 tasks per node), treat it as a new experiment configuration and record it.

Reproducible Post-Processing and Analysis

20. Make analysis steps fully scriptable

For post-processing (e.g. data reduction, statistics, plotting):

Implement them as scripts that can be run non-interactively.
Avoid manual, interactive editing in tools like spreadsheet software for final results.

Common patterns:

analyze_results.py raw_data_dir output_summary.csv
make_plots.py summary.csv figures/

Your final figures should be regenerable from the raw outputs using documented commands.

21. Separate environment for analysis when needed

Sometimes:

The compute environment (compilers, MPI, libraries) differs from the analysis environment (Python/R, visualization).

If so:

Document analysis environment separately (e.g. analysis_env.yml).
Ensure analysis scripts are version-controlled and tied to this environment.

22. Use notebooks carefully

If you use Jupyter/R notebooks:

Treat them as front-ends to scripts, not as the only place where logic lives.
Keep core computations in importable modules or scripts.
Run notebooks from start to finish before saving, to ensure they represent a consistent state.
Save notebooks and the data they consume in a reproducible structure.

Documentation and Communication

23. Maintain a “how to reproduce” document

Include a short, human-readable description of your workflow, e.g. REPRODUCE.md:

Required software and modules.
How to set up the environment (modules, containers, packages).
Steps to:

Prepare inputs,
Run the workflow on the cluster,
Run the analysis,
Generate final figures.

Test this occasionally (or ask a collaborator) to confirm it still works.

24. Write down assumptions and non-obvious details

Document:

Hardware assumptions (e.g. “requires GPUs with at least 16 GB memory”).
Expected runtime and cost (“~4 nodes × 12 hours per run”).
Numerical tolerances (“results reproducible within ±1e-6 relative error”).
Known sources of variability (e.g. different processor counts may change rounding error).

This contextual information helps others interpret your results and their reproducibility limits.

25. Be explicit about what is and is not reproducible

Some aspects may be:

Strictly reproducible (bitwise identical for given hardware and resources).
Statistically reproducible (same distributions, small numerical differences).
Only approximately reproducible due to hardware or implementation differences.

State clearly:

What guarantee you can provide.
The range of acceptable variation.

Lightweight Automation and Workflow Tools

26. Start simple; automate gradually

You do not need a complex workflow manager to be reproducible. A good baseline:

Shell scripts + version control,
Explicit config files,
Organized run directories,
Saved logs and metadata.

Once your workflow becomes more complex (many parameters, dependencies, or stages), consider lightweight tools like:

make for simple dependency-based workflows.
Small Python “driver” scripts that orchestrate tasks.
Simple job array techniques for parameter sweeps.

The goal is to replace informal, manual procedures with documented, executable steps.

27. Validate reproducibility periodically

Choose a small test case and occasionally:

Rebuild your code from scratch,
Reconstruct the environment from your documented steps,
Rerun the test,
Compare results against a stored reference output.

If discrepancies appear, investigate and update documentation accordingly.

Collaboration and Sharing

28. Use shared repositories for collaborative projects

For multi-person projects:

Use a shared repository (Git hosting, institutional service, etc.).
Agree on:

Branching and tagging conventions,
How to store and reference large data,
Where and how to document workflows.

This ensures all collaborators can reproduce and extend each other’s work.

29. Package artifacts for publication and archiving

When you publish or hand off work:

Provide:

Code (or a pointer to it),
Configuration files,
Environment/container recipes,
Representative input and output data (or DOIs).

Consider uploading:

Archived datasets to a data repository,
Code/version to a code repository with a release tag.

Aim for a package that a competent user familiar with HPC and the target system can use to reproduce your main results with reasonable effort.

Summary Checklist

A minimal reproducible HPC workflow should:

Use scripts (not manual commands) for all major steps.
Have all code, configs, and scripts under version control.
Record exact software and hardware environments.
Store inputs, configuration, outputs, and logs together per run.
Capture random seeds and key parameters.
Provide documented steps (e.g. REPRODUCE.md) that a third party can follow.

Starting with these practices, you can gradually refine and automate your workflows as your projects and systems grow more complex.

Comments

Please login to add a comment.

Don't have an account? Register now!