Kahibaro
Discord Login Register

Best practices for reproducible workflows

Why Reproducible Workflows Matter in HPC

In high performance computing, a result that cannot be reproduced is usually as bad as a result that never existed. Reproducible workflows let you or someone else run the same analysis later and obtain the same outputs, or at least understand and explain any differences. This is essential when you upgrade compilers, move to a new cluster, change libraries, or revisit a project years later.

On shared systems, environment changes are frequent. Modules are added and removed, default compilers change, filesystem layouts evolve, and hardware is upgraded. A workflow that is not carefully described and captured may silently change behaviour. Best practices for reproducible workflows are about controlling and recording all factors that influence your results, from source code and parameters to environment modules and job scripts.

Reproducibility in HPC is not only about code. It is about the combination of code, input data, parameters, software environment, hardware assumptions, and workflow steps.

Defining Levels of Reproducibility

It is helpful to distinguish between strict bitwise reproducibility and scientifically acceptable reproducibility.

Bitwise reproducibility means you can run the same job again and obtain identical bits in every output file. This is often hard in parallel applications, especially with floating point arithmetic and non deterministic reductions. It may also break across hardware or compiler changes.

Scientifically acceptable reproducibility means that repeated runs produce results that are equivalent for the purpose of the science or engineering question. For numerical simulations this often means that differences are smaller than a specified tolerance or do not change the conclusions of an analysis.

For workflows in HPC, you should decide early which level you aim for. For regulatory, safety critical, or audited environments, bitwise or very tight reproducibility may be required. For exploratory research, it is usually sufficient to document everything so that another person can regenerate similar results and understand any numeric differences.

Capturing the Computational Environment

A central best practice is to make the computational environment explicit and reconstructable. On HPC systems, that usually involves environment modules, site specific software stacks, and sometimes containers. The details of these tools are covered elsewhere. Here the focus is on what you should capture.

You should always record at least the following items for each major run. Record module names and versions that you load. Record compiler and compiler version. Record key library versions, especially linear algebra, MPI, and domain specific libraries. Record the operating system version if possible. Record the job scheduler that you used and its version if relevant.

The simplest way to capture this information is to print it at the start of every job. You can include commands like module list, env, and compiler version queries in your job script. Redirect their output to a log file next to your scientific results. This ties the environment description to a specific computation.

For more structured approaches, some environments support tools that snapshot the module state or generate environment manifests. If containers are used, store the container image or at least its build recipe along with the project. In both cases, include strong identifiers such as module versions or container image digests, for example SHA256 hashes, not only human readable names like "latest", which can change over time.

Never rely on default modules or implicit paths for reproducible work. Always load and record explicit versions of compilers, libraries, and tools in your workflow.

Making Workflows Scripted Instead of Manual

Manual, click driven, or ad hoc command sequences are very hard to reproduce. In HPC, best practice is to make every step that affects the final output part of an explicit, scripted workflow.

You should avoid running complex analysis directly on the command line without capturing the commands. Instead, create scripts for building code, preparing input data, launching jobs, and post processing outputs. Store these scripts under version control alongside your source code. This includes job submission scripts, environment setup scripts, and analysis scripts.

A simple pattern is to have a top level script that orchestrates the workflow. For example a script that sets up the environment, compiles the program, submits one or more batch jobs, and triggers post processing when results are available. Although job scheduling is asynchronous, you can still express dependencies explicitly with job scheduler features or with separate scripts that you run after job completion.

Whenever you find yourself typing a non trivial sequence of commands more than once, turn it into a script. This applies to data preparation and visualization as well. The goal is that every figure, table, or data product can be regenerated from scripts and raw data, without remembering manual steps.

Version Control for Code and Configuration

Reproducible workflows in HPC depend on robust version control. It is not enough to keep a copy of the code. You must know exactly which version of the code produced which result.

Best practice is to keep all source code in a version control system such as Git. This includes the scientific code, auxiliary scripts, job submission scripts, and configuration files that hold parameter sets. For each important run, capture and store the commit hash. You can do this programmatically by printing the current commit hash into the job log at run time or by embedding it in the executable during build using compiler defines.

Avoid making untracked changes on the cluster. If you are forced to patch code interactively, commit those changes and push them back to your main repository. Uncommitted modifications are a major threat to reproducibility, since they are easy to lose and hard to identify later.

Tagging repository states for specific results is helpful. For example, create tags like paperX-figure3 or simulationY-final. Combine these tags with job identifiers from the scheduler so that each dataset points to a specific code state and each code state points to a set of jobs.

Important rule: Every published or reported result must be traceable to a specific commit or tag in version control and a specific set of input parameters.

Managing Inputs, Parameters, and Randomness

Even with fully controlled code and environment, differences in inputs or parameters will change results. Reproducible workflows treat inputs, configuration, and random number usage as first class citizens.

Keep raw input data read only and store it in locations with clear, stable paths. If raw data is large, store checksums with it, for example SHA256 digests, and document the source of the data and any preprocessing steps. These preprocessing steps should also be scripted.

Configuration parameters should never be embedded in source code with ad hoc changes. Instead, keep them in text files such as JSON, YAML, or simple key value formats that your application reads at run time. Store these files under version control. For each run, keep a copy of the configuration file in the output directory. You can use naming conventions that include timestamps or job IDs.

When simulations or algorithms use randomness, always set and record random number seeds. This applies to your own code and to libraries when they allow you to set seeds. You may, for example, pass a seed from the job script into the application as a command line parameter. For runs with many tasks or MPI ranks, define a systematic scheme for seeds, such as a base seed plus the rank identifier, so that behaviour is controlled but reproducible.

It is also important to record derived parameters that the application computes automatically. For example, grid resolutions derived from memory limits or run time derived from input sizes. You can have the application print all important parameter values at startup and direct that output to log files.

Structuring Outputs and Metadata

Reproducible workflows benefit from a clear and consistent layout of result directories and metadata files. The goal is that a future reader can understand what a directory contains and how it was created, without guessing.

A common pattern is to create one directory per major run or experiment, named with a timestamp and a short description. Inside this directory, store input configurations, logs, raw outputs, and post processed results. Preserve the job script used to launch the run, the module list output, and the code version information. If you use containers, store the container manifest or reference.

You can also maintain a small metadata file in a simple format such as JSON or YAML that records key information about the run. Examples include code version hash, date, user, machine or cluster name, job ID, input files, main parameters, and a short description of the purpose. This can be written automatically by your workflow scripts when they create the run directory.

Avoid mixing results from different code versions or parameter sets in the same directory. If post processing or plotting scripts read from multiple runs, make sure they reference explicit paths instead of generic names like "latest". This reduces confusion when you rerun experiments with modified settings.

Important statement: Never overwrite original outputs without keeping a versioned backup. Use new directories or filenames when repeating or modifying runs.

Recording Hardware and Resource Assumptions

In HPC, performance and, sometimes, numerical results can depend on hardware characteristics and allocated resources. For example, using different numbers of MPI ranks per node may change floating point summation order. Using GPUs instead of CPUs may change precision. Best practices for reproducible workflows include recording these details.

For each run, keep the number of nodes, cores per task, MPI ranks, and threads per rank. Many job schedulers provide environment variables that specify these values. You can print them at runtime. Also record which partition or queue you used, since partitions may map to different hardware.

If your code depends on specific CPU instruction sets, such as AVX2 or AVX512, or specific GPU models, mention this in your documentation and run metadata. Different hardware generations may cause small numeric differences, especially for reduced precision arithmetic.

Although you cannot always guarantee the same physical node will be used later, documenting hardware assumptions allows you or others to interpret differences correctly and to choose equivalent resources when re running workflows on new systems.

Handling Floating Point and Numerical Reproducibility

Many HPC applications use floating point arithmetic at large scale, often with parallel reductions. Results may show small run to run differences even on the same machine, especially when thread or process scheduling changes. Best practices for reproducible workflows aim to control and quantify these effects.

One practice is to use numerically stable algorithms, such as compensated summation for large reductions, when bitwise reproducibility is important. Some libraries provide reproducible reduction operations that enforce a fixed order of operations, although these can be slower.

You should define acceptable tolerances for numerical comparisons and document them. For instance, you may assert that two results are considered equivalent if relative error is below $10^{-10}$ in aggregate measures. When comparing outputs from different runs, avoid comparing raw binary files. Instead, compare derived quantities, such as norms, integrals, or error metrics, that reflect the scientific meaning of the results.

Another important practice is to disable aggressive compiler optimizations that break strict IEEE floating point semantics if bitwise or very tight reproducibility is required. For example, avoid flags that allow re association of floating point operations if that conflicts with your reproducibility goals. At the same time, recognise that some performance optimizations depend on such relaxations, so this is a balance between performance and reproducibility.

Automated Testing and Continuous Verification

Reproducible workflows do not end after the first successful run. Over time, code evolves, environments change, and clusters are upgraded. To maintain reproducibility, you should include automated tests and periodic verification runs in your workflow.

Unit tests and regression tests can verify that small parts of the code behave identically after changes. For full workflows, you can maintain smaller test cases that run quickly and compare their outputs against reference results. When results involve floating point computation, tests should use tolerance based comparisons rather than exact matching.

Where possible, integrate these checks into continuous integration systems. Even if you cannot run full scale HPC jobs in such environments, you can still build the code with representative compilers, run small problem sizes, and confirm that results and performance remain within expected limits.

On the cluster itself, you can maintain a set of standard jobs that you rerun after major system updates or module changes. Comparing their outputs and run times provides early signals that something has changed in the environment that may impact reproducibility.

Key practice: Treat reproducibility as a property you must continuously test, not a one time achievement.

Documentation and Workflow Narratives

Scripts and metadata capture the mechanical parts of a workflow. Human readable documentation ties everything together and makes reproducibility achievable for others.

You should maintain a short narrative description of your workflow. This can be in a README file at the root of your project or in a dedicated document. It should explain the sequence of steps required to reproduce the main results, such as preparing inputs, building the code, submitting key jobs, and running post processing. Reference specific scripts, configuration files, and tags.

Where appropriate, document known pitfalls. For example, mention if a particular module version is required because a newer version changes output slightly, or if certain environment variables must be set. Explain which parts of the workflow are expected to be bitwise reproducible and which are only scientifically reproducible within tolerances.

For complex projects, consider including literate descriptions, for instance notebooks or text files that mix explanatory text and commands. Even if they cannot run directly on the HPC system, they serve as a guide for reproducing the workflow with the available scripts.

Organizing Projects for Longevity

A reproducible workflow is easier to preserve when the project is well structured. Over time, HPC projects accumulate scripts, datasets, and results. Without discipline, this can become unmanageable.

Use a consistent directory layout across projects. Separate source code, input data, configuration files, job scripts, and results into clear subdirectories. Avoid placing large generated files in the same location as source code under version control. Instead, keep version controlled metadata and configuration that describe how to regenerate the large files.

For very large datasets and outputs, coordinate with your site data management policies. Use parallel file systems appropriately during computation, but archive final results in more stable storage, such as project spaces or archival systems, and keep checksums and documentation there.

Finally, consider the perspective of a future collaborator or of your future self in several years. If they clone your repository and have access to the same or a similar cluster, they should be able to follow your documented workflow and regenerate key results, or at least understand why exact reproduction is not possible and what the differences imply.

Integrating Best Practices into Daily Work

The most successful reproducible workflows are not extraordinary efforts that you add at the end. They come from habits built into daily work. Start new projects with version control from the first day. Write job scripts and environment setup scripts before the first large run. Record parameters and seeds routinely. Organize outputs from the beginning.

At first, these practices may feel like extra work. Over time, they save time by preventing confusion, debugging of mysterious differences, and lost results. They also make it easier to collaborate, to migrate to new systems, and to build on previous work.

In HPC environments, where systems evolve quickly and computations are expensive, adopting these best practices for reproducible workflows is not optional. It is a core skill that supports reliable science, engineering, and industry applications.

Views: 1

Comments

Please login to add a comment.

Don't have an account? Register now!