20.4 Documentation and best practices summary

Table of Contents

Why Documentation Matters in HPC Projects

Documentation is part of the scientific and engineering result, not an optional extra. On HPC systems, undocumented workflows quickly become unusable, irreproducible, and impossible to hand over to collaborators. Good documentation turns a one‑off experiment into a reusable, auditable workflow that you or someone else can rerun months or years later.

In your final project, the quality of your documentation is as important as the correctness and performance of your code. It should allow a competent HPC user who has never seen your project before to understand what the project does, set up the environment, run it, and interpret key outputs.

Core rule: Someone with basic HPC skills should be able to reproduce your main results using only your repository, your documentation, and access to a similar cluster.

This chapter gives you a checklist and concrete patterns to achieve that goal.

Layers of Documentation in an HPC Project

An HPC project typically has several layers of documentation. Each has a different audience and purpose.

First, there is project‑level documentation. This explains what the project is about, how to obtain and build the code, how to run it on a cluster, and what results to expect.

Second, there is workflow documentation. This covers how to run the full workflow on an HPC system, including preprocessing, main runs, checkpointing, postprocessing, and analysis.

Third, there is code‑level documentation. This is aimed at developers and includes comments, function descriptions, and design notes.

Finally, there is environment and experiment documentation. This records which modules, containers, compilers, input data, and job parameters were used to obtain specific results.

Your final project should address all of these layers, even if briefly.

Minimum Documentation Package for the Final Project

For the purposes of this course, assume that your project repository must contain at least the following documentation artifacts.

You must have a README at the top level of your project. This is the entry point for any reader. It should summarize the problem you address, describe how to set up the environment, explain how to build the code, and provide example commands to run a minimal case and a representative performance case.

You must include one or more job scripts with clear comments. These scripts should show how you intend your code to be run on a cluster scheduler, including resources, wall time, and key environment variables.

You should provide configuration or parameter files with explanations of the main options. These files separate code from run‑time choices and make experiments repeatable and comparable.

You should include a brief performance and scaling note. It does not need to be long, but it should record at least one performance measurement and the conditions under which it was obtained.

Finally, you should have a short file that explains provenance and reproducibility. It should record the software stack, either in the form of module lists, container recipes, or conda environment files, and it should reference any external data sources used.

Structuring the Project README

The README is the document most people will read first. It should be concise and structured in predictable sections. Long background explanations belong elsewhere, for example in a report or paper.

A practical structure for your final project is as follows.

Start with a short overview. In one or two paragraphs, state the main problem, the computational approach, and the role of HPC in the project. Name the main programming languages and parallel programming models you used.

Then describe prerequisites and environment. Indicate which compilers, MPI implementations, GPU toolkits, or containers are required. If you relied on environment modules, list their names and example versions. If you used a container, point to the image file or the recipe used to build it.

Next, describe build instructions. Show exact commands to compile or configure the code on a typical cluster, preferably in a separate build section. If you use Make or CMake, mention the relevant targets and configuration options.

Then give run instructions. Provide at least one complete example command sequence that a user can copy and adjust. This should include any job submission commands, such as sbatch with a reference to the correct job script. If the workflow needs multiple steps, such as preprocessing, main run, and postprocessing, outline the sequence and refer to more detailed documentation in separate files when needed.

After that, document input and output. Explain what input data is required, where to get it, and how large it is. For outputs, list the main output files, their format, and how to interpret the most important numbers or plots.

Finally, add a short section on limitations and known issues. Document what you know does not work, such as unsupported compilers, very small or very large problem sizes, or features that are partially implemented.

Documenting HPC Workflows and Job Scripts

For an HPC project, the workflow on the cluster is just as important as the code. Many failures in reproducibility occur because the exact way runs were launched was not recorded.

Your job scripts and accompanying text should explain both the resource requests and the execution logic.

Inside job scripts, prefer descriptive comments to opaque flags. For example, explain why you request a certain number of nodes, tasks, and threads, or why you bind threads to specific cores. If your script sets environment variables that affect performance, such as OMP_NUM_THREADS or GPU visibility, document the effect of these settings.

If the workflow uses multiple sequential jobs, for example a pipeline or a chain of checkpoints, describe the ordering explicitly. You can either chain jobs through scheduler dependencies or document the manual sequence in a workflow section of your documentation.

If you use relative paths or scratch filesystems, mention where temporary data is stored and whether it is safe to delete it. On shared filesystems, explain any assumptions about directory structure that a new user must satisfy, such as input, output, or logs directories.

The goal is that a reader can take your job scripts, make minimal local adjustments, and reproduce your runs.

Recording Software Environments and Dependencies

On HPC systems, the software environment can change over time. Modules are updated, system libraries are replaced, and user environments evolve. Without a record of your environment, a working project can silently break.

In your final project, always record the software stack in a machine readable and human readable form.

On clusters that use environment modules, include the output of module list in a file, for instance environment-modules.txt. If the list is long, you can edit it to remove irrelevant modules, but keep all that affect compilation and runtime.

If you use containers, include your container recipe, such as a Singularity or Apptainer definition file, or at minimum a reference to the image version and where it can be obtained.

If you use a language specific package manager, such as pip or conda, export your environment to a file. For conda this can be environment.yml, for pip it can be requirements.txt, for R a list of installed packages.

Additionally, in your documentation, summarize the most critical version constraints. For example, indicate the MPI standard level needed, the minimum CUDA version if using GPUs, and any special compiler features required.

This information allows others to reconstruct a compatible environment, and also helps you diagnose differences in performance and behavior between runs.

Documenting Experiments, Parameters, and Results

In an HPC project, one usually runs many experiments, sometimes with slight variations in parameters and resources. Without a clear record, it becomes difficult to link a plot or table back to the exact run that produced it.

For your final project, maintain at least a simple experiment log. This can be a text file, a spreadsheet, or a structured document in your repository.

At minimum, for each key run that appears in your report, record the date, problem size or relevant input parameters, resources requested, such as nodes, tasks, threads, GPUs, the job script used, or a reference to it, and the environment snapshot, such as a module list or container version. Also record the main performance metrics or scientific results you extracted. For example, note runtimes, speedups, scaling efficiency, or error norms.

A compact way to structure such a log is to give each experiment a short identifier, to refer to it in figures and text. You can name jobs and then use those names in file names and plots.

If your code accepts configuration files, prefer to keep these files in a configs directory, and do not overwrite them. This creates a clear link between configuration files and results.

Best practice: Every figure or table in your final report should be traceable to a specific run, with documented input parameters, resources, and environment.

Source Code Documentation Focused on HPC Aspects

Detailed programming tutorials belong elsewhere in the course, but for the final project you should write code with enough documentation that another HPC‑capable programmer can maintain and extend it.

Focus your comments and code documentation on aspects that are specific to parallel and high performance behavior.

Document the parallel decomposition. For example, explain how data is distributed across MPI ranks or threads, which communicator is used for what purpose, and how boundaries or halos are managed.

Describe any nontrivial synchronization or communication patterns, such as custom reductions, neighborhood exchanges, or nonblocking patterns. Point to the relevant functions or modules that implement these patterns.

If your performance relies on a particular data layout or memory access pattern, explain the rationale where it is implemented. For example, mention why you use structure of arrays instead of array of structures in a key kernel.

Comment on assumptions about problem size, hardware topology, or node architecture. If code assumes a fixed number of ranks along a dimension, or a particular NUMA layout, this should be clearly stated.

Keep code comments focused and update them when refactoring. Outdated comments are worse than none.

Documenting Performance and Scaling Results

The final project typically includes a performance analysis component. Documentation must tie the performance results to the conditions under which they were obtained.

When you report performance, always specify at least three elements. First, describe the problem size or workload. Second, indicate the hardware and resource configuration. Third, state the measurement methodology, such as wall clock times averaged over multiple runs.

If you present strong or weak scaling, explicitly define your scaling scenario in the documentation. For strong scaling, clarify the fixed problem size and the range of resources. For weak scaling, note how the problem size grows with the number of processes or nodes.

When possible, publish raw timing data in simple text or CSV files alongside your repository. These files should include the number of processes, threads, GPUs, and relevant timing metrics such as total runtime, per step time, and communication time if available.

If the code includes timers or profiling hooks, document how to enable or disable them, and how to interpret their output. Mention any profiling tools that you used, even if they are not required to run the code.

This level of documentation allows others to validate your performance claims and reuse your measurements in future work.

Reproducible Workflows and Automation

Manual steps are fragile. For your final project, aim to reduce undocumented manual intervention wherever possible and to record any steps that must remain manual.

Automating common tasks through scripts not only saves time but also acts as executable documentation. For example, a single script can set up the environment, create directories, and submit a series of jobs. Such scripts should be extensively commented and referenced from your main documentation.

If your workflow requires editing configuration files, consider providing template files and explaining exactly which parameters to change. Where possible, prefer command line arguments to editing files manually, and document example invocations that match the experiments in your report.

If data preprocessing or postprocessing is required, include scripts or notebooks that perform these tasks. Document the input and output paths and formats. State explicitly which preprocessing steps are necessary before the main HPC runs can start, and which postprocessing steps are needed to obtain the plots and tables shown in your report.

By making your workflow executable through scripts and job files, you transform documentation from a static description into something that can be tested and verified.

Organizing the Project Repository

Good organization is part of good documentation. A clear directory structure makes it easier to find code, input data, scripts, and documentation, without reading long descriptions.

A typical structure for a small HPC project includes separate top level directories for source code, job scripts, configuration files, and documentation. Results and large data are often stored outside of the repository or under a separate results directory that is selectively version controlled.

For your final project, choose a simple and consistent structure and describe it briefly in your README. Explain where to find example job scripts, where configuration files live, and where output is expected. If directories must exist before running jobs, such as logs or scratch, either create them via scripts or explain how to create them.

Within the documentation, use consistent names and paths to avoid confusion. If you rename files or directories late in the project, verify that all references in the documentation are updated.

Best Practices Checklist for Your Final Submission

To help you review your project before submission, you can use the following checklist as a summary of documentation and workflow best practices.

Check that your project has a clear README at the top level, with description, environment, build, run, input and output, and limitations sections.

Verify that at least one complete job script is included and documented, with comments on resource choices and environment variables.

Confirm that you have a record of the software environment, such as module lists, container recipes, or package requirement files.

Ensure that each key experiment in your report can be traced back to specific configuration files, job scripts, and log or output files.

Make sure that performance results are documented together with problem size, hardware configuration, and measurement method, and that raw timing data is available.

Check that your repository is organized and that the directory layout is described in the documentation.

Finally, test your own documentation by following it from scratch on a clean environment as much as possible. This is the best way to find gaps, implicit assumptions, or missing steps.

If you meet these criteria, your final project will not only demonstrate your understanding of high performance computing concepts, but will also serve as a solid, reusable example of a well documented HPC workflow.

Comments

Please login to add a comment.

Don't have an account? Register now!