Kahibaro
Discord Login Register

Best practices for reproducible workflows

Table of Contents

Why Reproducibility Matters in HPC Workflows

In HPC, “reproducible” does not only mean “same numbers on the same machine.” It typically means:

can rerun your work and:

HPC adds extra challenges: modules, queues, schedulers, parallelism, and evolving software stacks. Best practices aim to make your work independent of your memory, your personal environment, and your current cluster.

General Principles for Reproducible Workflows

1. Treat your workflow as code

Anything you do manually and repeatedly should be:

Typical artifacts:

Rules of thumb:

2. Prefer scripts over ad-hoc commands

Instead of:

use:

Example pattern:

./prepare_input.py config.yaml
sbatch run_case.slurm
./analyze_results.py output/

You should be able to rerun the whole workflow (or key parts) with a single command once everything is set up.

3. Make runs parameter-driven, not hard-coded

Replace values that change between runs (problem size, input files, physics parameters, number of processes, etc.) with:

Example:

./run_experiment.sh problem_size=1024 method=GMRES

This avoids copying and editing scripts for each variant, which quickly becomes unmanageable and error-prone.

4. Make randomness reproducible

For codes that use randomness (Monte Carlo, stochastic optimization, randomized algorithms):

If full bitwise reproducibility is not possible due to parallelism or hardware differences, document expected numerical variability and tolerance.

Capturing the Software Environment

(Parent chapters cover modules, stacks, and containers. Here we focus on using them for reproducibility.)

5. Export and record your environment

Before or during runs, record:

Example commands (to put into job scripts):

echo "Date: $(date)"
echo "Hostname: $(hostname)"
echo "Working directory: $(pwd)"
echo "Git commit: $(git rev-parse HEAD 2>/dev/null || echo 'not a git repo')"
echo "Loaded modules:"
module list 2>&1
echo "Environment snapshot:"
env | sort

Redirect this to a log file that you archive with the results, e.g. run_001_env.log.

6. Pin versions; avoid “floating” dependencies

Avoid unqualified “latest” versions:

Maintain:

This enables you (and others) to reconstruct the environment even when the default cluster environment changes.

7. Prefer stable, curated software stacks

Where available:

Store build scripts (or at least build command logs) in version control along with your code.

8. Use containers or environment snapshots when possible

If your center supports containers / environment managers:

Benefits:

Document:

Capturing Inputs, Outputs, and Configuration

9. Treat input data as versioned, immutable artifacts

For input files:

If data is preprocessed:

10. Use explicit configuration files

Instead of many scattered flags and environment variables, use a configuration file that:

Format examples: .yaml, .json, .ini, plain text key=value.

Sample pattern:

11. Organize run directories consistently

Use a consistent directory structure for your experiments. For example:

project/
  code/
  configs/
  runs/
    run_001/
      config.yaml
      job.slurm
      stdout.log
      stderr.log
      env.log
      results/
        ...
    run_002/
      ...

Conventions that help:

12. Keep raw outputs; separate analysis and plotting

To support future reinterpretation:

This way, if you change analysis methods or fix a bug, you can rerun the analysis on the same raw data without redoing expensive simulations.

13. Log enough metadata for each run

For every run, capture at minimum:

Store this in a simple text file, e.g. metadata.txt:

code_version: 8b32c1f
config_file: config_large.yaml
job_script: job_large.slurm
nodes: 4
tasks_per_node: 32
gpus_per_node: 1
submit_time: 2025-11-02T09:13:45Z
finish_time: 2025-11-02T12:47:21Z

Version Control Practices for Workflows

14. Use version control for more than just source code

Put under version control (e.g. git):

Do not put:

15. Tie runs to code versions

Avoid the situation “I don’t know which version of the code produced this figure.”

Practices:

Optional automation: Have your build system embed the Git hash into the executable at compile time.

16. Make changes traceable

When you change:

record this in commit messages and/or a CHANGELOG.md or experiments.md. This gives future you (or collaborators) a concise narrative of what changed and why.

Scheduling and Job-Script Practices That Aid Reproducibility

(Details of job schedulers are covered elsewhere. Here we focus on conventions.)

17. Keep job scripts under version control and parameterized

Avoid manually editing a single job script for each run. Instead:

You can:

18. Log scheduler and resource information

In your job script, record:

Append this to your metadata/log files:

echo "SLURM_JOB_ID: $SLURM_JOB_ID" >> metadata.txt
echo "SLURM_NODELIST: $SLURM_NODELIST" >> metadata.txt

This helps diagnose performance differences and understand where the job actually ran.

19. Use deterministic resource specifications

Avoid relying on “whatever resources the scheduler gives me.” Instead:

If you later change resource layout (e.g. from 16 to 32 tasks per node), treat it as a new experiment configuration and record it.

Reproducible Post-Processing and Analysis

20. Make analysis steps fully scriptable

For post-processing (e.g. data reduction, statistics, plotting):

Common patterns:

Your final figures should be regenerable from the raw outputs using documented commands.

21. Separate environment for analysis when needed

Sometimes:

If so:

22. Use notebooks carefully

If you use Jupyter/R notebooks:

Documentation and Communication

23. Maintain a “how to reproduce” document

Include a short, human-readable description of your workflow, e.g. REPRODUCE.md:

Test this occasionally (or ask a collaborator) to confirm it still works.

24. Write down assumptions and non-obvious details

Document:

This contextual information helps others interpret your results and their reproducibility limits.

25. Be explicit about what is and is not reproducible

Some aspects may be:

State clearly:

Lightweight Automation and Workflow Tools

26. Start simple; automate gradually

You do not need a complex workflow manager to be reproducible. A good baseline:

Once your workflow becomes more complex (many parameters, dependencies, or stages), consider lightweight tools like:

The goal is to replace informal, manual procedures with documented, executable steps.

27. Validate reproducibility periodically

Choose a small test case and occasionally:

If discrepancies appear, investigate and update documentation accordingly.

Collaboration and Sharing

28. Use shared repositories for collaborative projects

For multi-person projects:

This ensures all collaborators can reproduce and extend each other’s work.

29. Package artifacts for publication and archiving

When you publish or hand off work:

Aim for a package that a competent user familiar with HPC and the target system can use to reproduce your main results with reasonable effort.

Summary Checklist

A minimal reproducible HPC workflow should:

Starting with these practices, you can gradually refine and automate your workflows as your projects and systems grow more complex.

Views: 16

Comments

Please login to add a comment.

Don't have an account? Register now!