Kahibaro
Discord Login Register

Documentation and best practices summary

Why documentation matters in HPC

In HPC, clear documentation is not optional:

For the final project, treat documentation as part of the deliverable, not an afterthought.

Key goals:

Core project documentation artifacts

For the final project, you should produce at least these four types of documentation:

  1. README (top-level overview)
  2. Run instructions (how to build and execute on the cluster)
  3. Performance & scaling notes (what you measured and what it means)
  4. Reproducibility metadata (environment, versions, configurations)

These can be separate files or sections in a single main document, as long as they are clearly organized.

1. README: the entry point

A good README.md (or similar) answers:

  module load gcc/12.2 openmpi/4.1
  mkdir build && cd build
  cmake ..
  make -j
  sbatch ../scripts/run_weak_scaling.slurm

Keep it concise; details go in more specific files or sections.

2. Run instructions: build and execution

HPC projects live or die on whether they can be rebuilt and rerun.

Build instructions

Specify:

  module purge
  module load gcc/12.2 openmpi/4.1 cmake/3.27
  # optional: module list > modules_used.txt
  mkdir -p build && cd build
  cmake -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON ..
  make -j8

Note any known build variants, e.g.:

You do not need to re-explain compilers or build systems here; just document how your project uses them.

Execution instructions

Clarify:

  ./my_app --nx 128 --ny 128 --steps 10
  sbatch scripts/strong_scaling_4nodes.slurm
  sbatch scripts/strong_scaling_8nodes.slurm

For each provided job script, briefly note:

Parameters and configuration

Avoid “magic numbers” in your instructions. Document:

Example:

  ./my_app --nx NX --ny NY --steps STEPS [--output-interval N] [--checkpoints PATH]

3. Performance and scaling documentation

Your final project includes performance analysis; here you document the what, how, and summary conclusions, not every raw log line.

What to record

For each experiment (e.g. node counts, problem sizes, GPU vs CPU):

How to structure performance notes

Use a file like docs/performance.md:

Describe:

Example table:

| Nodes | Ranks/Node | Threads/Rank | Problem Size | Time (s) | Speedup vs 1 Node |
|-------|------------|--------------|--------------|----------|-------------------|
| 1 | 4 | 8 | 1024³ | 120 | 1.0 |
| 2 | 4 | 8 | 1024³ | 65 | 1.85 |
| 4 | 4 | 8 | 1024³ | 35 | 3.43 |

Just a few bullet points:

Tie this back to the performance concepts from earlier in the course (strong/weak scaling, load balance, communication overhead) without re-explaining them in depth.

4. Reproducibility metadata

Reproducibility in HPC is often blocked by missing environment information. Capture:

System and environment

Document:

  module list > docs/modules_final_runs.txt

Code version

If using version control:

If not using version control, archive a snapshot and mention the archive name, e.g.:

Input data and outputs

To the extent allowed by the project:

Code-level documentation and organization

Your project is small enough that full-scale API documentation tools are optional, but some structure is essential.

Minimal expectations

Focus comments on why something is done, not only what.

Documenting parallel design

Capture the parallel structure at a high level (1–2 short sections in your docs):

This helps reviewers quickly relate the performance results to the implementation choices.

Logging, error handling, and run annotation

Basic logging greatly helps in debugging and performance analysis.

  Problem size: 1024 x 1024 x 1024
  Ranks: 64, Threads per rank: 8
  Time step: 0.001, Steps: 1000

Document in your usage notes:

Project report and README cross-linking

For the final submission:

This keeps your project easy to navigate and reduces inconsistencies when you update something.

Practical best practices checklist

Use this as a quick self-check when finalizing your project:

If someone can take your repository, follow your documentation, and reproduce your main results on a similar cluster with minimal guesswork, your documentation meets the standard expected for this course and prepares you for real-world HPC projects.

Views: 12

Comments

Please login to add a comment.

Don't have an account? Register now!