20.4 Documentation and best practices summary

Why documentation matters in HPC

In HPC, clear documentation is not optional:

You will not remember every build flag or module a year from now.
Colleagues, support staff, and your future self need to understand and reproduce your work.
Clusters change: modules are updated, compilers change, queues are reconfigured. Good documentation is the only way to keep your workflow reproducible.

For the final project, treat documentation as part of the deliverable, not an afterthought.

Key goals:

Make your work re-runnable: another user can run your code and scripts and get the same type of results.
Make your work understandable: the main design ideas and limitations are clear.
Make your work maintainable: small changes (new dataset, different node count) are easy.

Core project documentation artifacts

For the final project, you should produce at least these four types of documentation:

README (top-level overview)
Run instructions (how to build and execute on the cluster)
Performance & scaling notes (what you measured and what it means)
Reproducibility metadata (environment, versions, configurations)

These can be separate files or sections in a single main document, as long as they are clearly organized.

1. README: the entry point

A good README.md (or similar) answers:

What problem are you solving?
Short description in 2–5 sentences; mention whether it is simulation, data analysis, etc.
What does the code do?
One paragraph on major features, not every function.
What is the input and output?

Input: files, parameters, and typical sizes (e.g. matrix size, grid resolution).
Output: what is produced (timing logs, plots, data files, etc.).

What is required to run it?

Language and main dependencies (e.g. C++17, MPI, OpenMP, CUDA, specific libraries).
Expected environment (e.g. “tested on clusterX with GCC 12 and OpenMPI 4”).

Quick start example
Provide a minimal “from zero to run” sequence such as:

  module load gcc/12.2 openmpi/4.1
  mkdir build && cd build
  cmake ..
  make -j
  sbatch ../scripts/run_weak_scaling.slurm

Keep it concise; details go in more specific files or sections.

2. Run instructions: build and execution

HPC projects live or die on whether they can be rebuilt and rerun.

Build instructions

Specify:

Modules or environment setup
Document in a env-setup.sh or a docs/environment.md:

  module purge
  module load gcc/12.2 openmpi/4.1 cmake/3.27
  # optional: module list > modules_used.txt

Build system usage
Be explicit about commands and options:

  mkdir -p build && cd build
  cmake -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON ..
  make -j8

Note any known build variants, e.g.:

Debug vs release builds
CUDA vs non-CUDA builds
MPI-only vs hybrid MPI+OpenMP

You do not need to re-explain compilers or build systems here; just document how your project uses them.

Execution instructions

Clarify:

How to run on a login node (if allowed for small tests)
For basic debugging or tiny test runs:

  ./my_app --nx 128 --ny 128 --steps 10

How to submit batch jobs
Show at least one working example per major experiment type (e.g. strong scaling, weak scaling):

  sbatch scripts/strong_scaling_4nodes.slurm
  sbatch scripts/strong_scaling_8nodes.slurm

For each provided job script, briefly note:

What it does (strong vs weak scaling, target runtime, etc.).
Assumptions (queue/partition, time limit, memory per node, GPU type, etc.).

Parameters and configuration

Avoid “magic numbers” in your instructions. Document:

Important parameters: grid size, iteration counts, solver tolerances, I/O frequency.
How to change them: command-line options, configuration file, or compile-time #define.

Example:

Command-line arguments documented in docs/usage.md, with a short synopsis:

  ./my_app --nx NX --ny NY --steps STEPS [--output-interval N] [--checkpoints PATH]

Default values and recommended ranges (for the cluster used).

3. Performance and scaling documentation

Your final project includes performance analysis; here you document the what, how, and summary conclusions, not every raw log line.

What to record

For each experiment (e.g. node counts, problem sizes, GPU vs CPU):

Clear description of the configuration:

Nodes, tasks-per-node, threads-per-task, GPUs-per-node.
Problem size and any relevant parameters.

Performance metrics:

Wall-clock runtime.
Possibly derived metrics (e.g. iterations per second, GFLOP/s if known).

Scaling type:

Strong vs weak scaling experiments, as appropriate.

How to structure performance notes

Use a file like docs/performance.md:

Experiment setup section

Describe:

What experiments you performed (e.g. strong scaling from 1 to 16 nodes).
Which scripts correspond to which experiments (scripts/run_strong_1n.slurm, etc.).
Any environmental assumptions (queue/partition, time limit, node type).

Tables or concise plots

Example table:

| Nodes | Ranks/Node | Threads/Rank | Problem Size | Time (s) | Speedup vs 1 Node |
|-------|------------|--------------|--------------|----------|-------------------|
| 1 | 4 | 8 | 1024³ | 120 | 1.0 |
| 2 | 4 | 8 | 1024³ | 65 | 1.85 |
| 4 | 4 | 8 | 1024³ | 35 | 3.43 |

Short interpretation

Just a few bullet points:

Where scaling works well and where it saturates.
Any major bottlenecks or unexpected behaviors.
Impact of thread count, rank count, or GPUs.

Tie this back to the performance concepts from earlier in the course (strong/weak scaling, load balance, communication overhead) without re-explaining them in depth.

4. Reproducibility metadata

Reproducibility in HPC is often blocked by missing environment information. Capture:

System and environment

Document:

Cluster name or environment (as far as you’re allowed to record).
OS and kernel version (if easily available).
Module list for your runs:

  module list > docs/modules_final_runs.txt

Compiler and MPI versions (or CUDA, math libraries, etc.).

Code version

If using version control:

Reference the commit hash used for final results.
Briefly list any important branches or tags (e.g. final-project-submission).

If not using version control, archive a snapshot and mention the archive name, e.g.:

archive/final_project_code_2025-12-10.tar.gz

Input data and outputs

To the extent allowed by the project:

Input data:

Filenames and directories (e.g. data/input_grid_1024.bin).
Links or instructions to obtain public datasets, if used.
Any preprocessing steps (scripts, utilities).

Outputs:

Where main results are written (e.g. results/strong_scaling/).
Key files to inspect (e.g. timings.csv, scaling_plot.png).
How to regenerate plots from raw logs (e.g. python scripts/make_plots.py).

Code-level documentation and organization

Your project is small enough that full-scale API documentation tools are optional, but some structure is essential.

Minimal expectations

Clear filenames and directory structure
Organize logically, for example:

src/ – source code
include/ – headers (for C/C++)
scripts/ – job scripts and helper scripts
docs/ – all documentation
results/ – generated results (may be excluded from version control)

Inline comments for non-obvious logic
Especially around:

Parallel communication patterns.
Synchronization points (barriers, locks, reductions).
Non-trivial optimizations or workarounds.

Focus comments on why something is done, not only what.

Documenting parallel design

Capture the parallel structure at a high level (1–2 short sections in your docs):

How the work is divided:

MPI: domain decomposition type (1D/2D/3D, block, cyclic, etc.).
OpenMP: main parallel regions and loops.
GPU: how computation is mapped to threads/blocks, if applicable.

Key communication/synchronization points:

Location of collectives (e.g. MPI_Allreduce, MPI_Barrier).
Critical sections or atomic updates.

This helps reviewers quickly relate the performance results to the implementation choices.

Logging, error handling, and run annotation

Basic logging greatly helps in debugging and performance analysis.

Command-line and parameters
Print essential parameters at program start:

  Problem size: 1024 x 1024 x 1024
  Ranks: 64, Threads per rank: 8
  Time step: 0.001, Steps: 1000

Timing output
Write concise timing summaries to a file per run:

Total time
Time in main phases (compute, communication, I/O)
Optional: rank 0 prints aggregated statistics

Error messages
Avoid silent failures; print clear messages indicating what went wrong (bad input size, missing file, failed allocation).

Document in your usage notes:

Where logs are written.
Any environment variables or options that control verbosity.

Project report and README cross-linking

For the final submission:

Make the README the landing page:

Brief description.
Pointers to:

docs/usage.md or “Running the code” section.
docs/performance.md.
Any additional detailed report (PDF, markdown).

Avoid duplication:

High-level summary in the README.
Details (tables, plots, methodology) in dedicated docs.

This keeps your project easy to navigate and reduces inconsistencies when you update something.

Practical best practices checklist

Use this as a quick self-check when finalizing your project:

[ ] A top-level README exists and clearly states:

[ ] Problem description
[ ] Basic capabilities
[ ] Requirements
[ ] Quick start commands

[ ] Build instructions:

[ ] List required modules / environment
[ ] Show exact build commands
[ ] Mention key build variants (debug/release, GPU/CPU, etc.)

[ ] Run instructions:

[ ] Include at least one working batch script example
[ ] Explain how to change problem size or resources
[ ] Indicate expected runtime scale (minutes vs hours)

[ ] Performance documentation:

[ ] Tables or plots for key experiments
[ ] Clear mapping from experiments to job scripts
[ ] Short interpretation of results

[ ] Reproducibility:

[ ] Module list and versions saved
[ ] Code version (commit/tag or archive) recorded
[ ] Input and output locations documented

[ ] Code structure:

[ ] Directory layout is simple and logical
[ ] Non-obvious parallel logic is commented
[ ] Logs and timings are written in a consistent format

If someone can take your repository, follow your documentation, and reproduce your main results on a similar cluster with minimal guesswork, your documentation meets the standard expected for this course and prepares you for real-world HPC projects.

Comments

Please login to add a comment.

Don't have an account? Register now!