Table of Contents
Why documentation matters in HPC
In HPC, clear documentation is not optional:
- You will not remember every build flag or module a year from now.
- Colleagues, support staff, and your future self need to understand and reproduce your work.
- Clusters change: modules are updated, compilers change, queues are reconfigured. Good documentation is the only way to keep your workflow reproducible.
For the final project, treat documentation as part of the deliverable, not an afterthought.
Key goals:
- Make your work re-runnable: another user can run your code and scripts and get the same type of results.
- Make your work understandable: the main design ideas and limitations are clear.
- Make your work maintainable: small changes (new dataset, different node count) are easy.
Core project documentation artifacts
For the final project, you should produce at least these four types of documentation:
- README (top-level overview)
- Run instructions (how to build and execute on the cluster)
- Performance & scaling notes (what you measured and what it means)
- Reproducibility metadata (environment, versions, configurations)
These can be separate files or sections in a single main document, as long as they are clearly organized.
1. README: the entry point
A good README.md (or similar) answers:
- What problem are you solving?
Short description in 2–5 sentences; mention whether it is simulation, data analysis, etc. - What does the code do?
One paragraph on major features, not every function. - What is the input and output?
- Input: files, parameters, and typical sizes (e.g. matrix size, grid resolution).
- Output: what is produced (timing logs, plots, data files, etc.).
- What is required to run it?
- Language and main dependencies (e.g. C++17, MPI, OpenMP, CUDA, specific libraries).
- Expected environment (e.g. “tested on
clusterXwith GCC 12 and OpenMPI 4”). - Quick start example
Provide a minimal “from zero to run” sequence such as:
module load gcc/12.2 openmpi/4.1
mkdir build && cd build
cmake ..
make -j
sbatch ../scripts/run_weak_scaling.slurmKeep it concise; details go in more specific files or sections.
2. Run instructions: build and execution
HPC projects live or die on whether they can be rebuilt and rerun.
Build instructions
Specify:
- Modules or environment setup
Document in aenv-setup.shor adocs/environment.md:
module purge
module load gcc/12.2 openmpi/4.1 cmake/3.27
# optional: module list > modules_used.txt- Build system usage
Be explicit about commands and options:
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON ..
make -j8Note any known build variants, e.g.:
- Debug vs release builds
- CUDA vs non-CUDA builds
- MPI-only vs hybrid MPI+OpenMP
You do not need to re-explain compilers or build systems here; just document how your project uses them.
Execution instructions
Clarify:
- How to run on a login node (if allowed for small tests)
For basic debugging or tiny test runs:
./my_app --nx 128 --ny 128 --steps 10- How to submit batch jobs
Show at least one working example per major experiment type (e.g. strong scaling, weak scaling):
sbatch scripts/strong_scaling_4nodes.slurm
sbatch scripts/strong_scaling_8nodes.slurmFor each provided job script, briefly note:
- What it does (strong vs weak scaling, target runtime, etc.).
- Assumptions (queue/partition, time limit, memory per node, GPU type, etc.).
Parameters and configuration
Avoid “magic numbers” in your instructions. Document:
- Important parameters: grid size, iteration counts, solver tolerances, I/O frequency.
- How to change them: command-line options, configuration file, or compile-time
#define.
Example:
- Command-line arguments documented in
docs/usage.md, with a short synopsis:
./my_app --nx NX --ny NY --steps STEPS [--output-interval N] [--checkpoints PATH]- Default values and recommended ranges (for the cluster used).
3. Performance and scaling documentation
Your final project includes performance analysis; here you document the what, how, and summary conclusions, not every raw log line.
What to record
For each experiment (e.g. node counts, problem sizes, GPU vs CPU):
- Clear description of the configuration:
- Nodes, tasks-per-node, threads-per-task, GPUs-per-node.
- Problem size and any relevant parameters.
- Performance metrics:
- Wall-clock runtime.
- Possibly derived metrics (e.g. iterations per second, GFLOP/s if known).
- Scaling type:
- Strong vs weak scaling experiments, as appropriate.
How to structure performance notes
Use a file like docs/performance.md:
- Experiment setup section
Describe:
- What experiments you performed (e.g. strong scaling from 1 to 16 nodes).
- Which scripts correspond to which experiments (
scripts/run_strong_1n.slurm, etc.). - Any environmental assumptions (queue/partition, time limit, node type).
- Tables or concise plots
Example table:
| Nodes | Ranks/Node | Threads/Rank | Problem Size | Time (s) | Speedup vs 1 Node |
|-------|------------|--------------|--------------|----------|-------------------|
| 1 | 4 | 8 | 1024³ | 120 | 1.0 |
| 2 | 4 | 8 | 1024³ | 65 | 1.85 |
| 4 | 4 | 8 | 1024³ | 35 | 3.43 |
- Short interpretation
Just a few bullet points:
- Where scaling works well and where it saturates.
- Any major bottlenecks or unexpected behaviors.
- Impact of thread count, rank count, or GPUs.
Tie this back to the performance concepts from earlier in the course (strong/weak scaling, load balance, communication overhead) without re-explaining them in depth.
4. Reproducibility metadata
Reproducibility in HPC is often blocked by missing environment information. Capture:
System and environment
Document:
- Cluster name or environment (as far as you’re allowed to record).
- OS and kernel version (if easily available).
- Module list for your runs:
module list > docs/modules_final_runs.txt- Compiler and MPI versions (or CUDA, math libraries, etc.).
Code version
If using version control:
- Reference the commit hash used for final results.
- Briefly list any important branches or tags (e.g.
final-project-submission).
If not using version control, archive a snapshot and mention the archive name, e.g.:
archive/final_project_code_2025-12-10.tar.gz
Input data and outputs
To the extent allowed by the project:
- Input data:
- Filenames and directories (e.g.
data/input_grid_1024.bin). - Links or instructions to obtain public datasets, if used.
- Any preprocessing steps (scripts, utilities).
- Outputs:
- Where main results are written (e.g.
results/strong_scaling/). - Key files to inspect (e.g.
timings.csv,scaling_plot.png). - How to regenerate plots from raw logs (e.g.
python scripts/make_plots.py).
Code-level documentation and organization
Your project is small enough that full-scale API documentation tools are optional, but some structure is essential.
Minimal expectations
- Clear filenames and directory structure
Organize logically, for example: src/– source codeinclude/– headers (for C/C++)scripts/– job scripts and helper scriptsdocs/– all documentationresults/– generated results (may be excluded from version control)- Inline comments for non-obvious logic
Especially around: - Parallel communication patterns.
- Synchronization points (barriers, locks, reductions).
- Non-trivial optimizations or workarounds.
Focus comments on why something is done, not only what.
Documenting parallel design
Capture the parallel structure at a high level (1–2 short sections in your docs):
- How the work is divided:
- MPI: domain decomposition type (1D/2D/3D, block, cyclic, etc.).
- OpenMP: main parallel regions and loops.
- GPU: how computation is mapped to threads/blocks, if applicable.
- Key communication/synchronization points:
- Location of collectives (e.g.
MPI_Allreduce,MPI_Barrier). - Critical sections or atomic updates.
This helps reviewers quickly relate the performance results to the implementation choices.
Logging, error handling, and run annotation
Basic logging greatly helps in debugging and performance analysis.
- Command-line and parameters
Print essential parameters at program start:
Problem size: 1024 x 1024 x 1024
Ranks: 64, Threads per rank: 8
Time step: 0.001, Steps: 1000- Timing output
Write concise timing summaries to a file per run: - Total time
- Time in main phases (compute, communication, I/O)
- Optional: rank 0 prints aggregated statistics
- Error messages
Avoid silent failures; print clear messages indicating what went wrong (bad input size, missing file, failed allocation).
Document in your usage notes:
- Where logs are written.
- Any environment variables or options that control verbosity.
Project report and README cross-linking
For the final submission:
- Make the README the landing page:
- Brief description.
- Pointers to:
docs/usage.mdor “Running the code” section.docs/performance.md.- Any additional detailed report (PDF, markdown).
- Avoid duplication:
- High-level summary in the README.
- Details (tables, plots, methodology) in dedicated docs.
This keeps your project easy to navigate and reduces inconsistencies when you update something.
Practical best practices checklist
Use this as a quick self-check when finalizing your project:
- [ ] A top-level
READMEexists and clearly states: - [ ] Problem description
- [ ] Basic capabilities
- [ ] Requirements
- [ ] Quick start commands
- [ ] Build instructions:
- [ ] List required modules / environment
- [ ] Show exact build commands
- [ ] Mention key build variants (debug/release, GPU/CPU, etc.)
- [ ] Run instructions:
- [ ] Include at least one working batch script example
- [ ] Explain how to change problem size or resources
- [ ] Indicate expected runtime scale (minutes vs hours)
- [ ] Performance documentation:
- [ ] Tables or plots for key experiments
- [ ] Clear mapping from experiments to job scripts
- [ ] Short interpretation of results
- [ ] Reproducibility:
- [ ] Module list and versions saved
- [ ] Code version (commit/tag or archive) recorded
- [ ] Input and output locations documented
- [ ] Code structure:
- [ ] Directory layout is simple and logical
- [ ] Non-obvious parallel logic is commented
- [ ] Logs and timings are written in a consistent format
If someone can take your repository, follow your documentation, and reproduce your main results on a similar cluster with minimal guesswork, your documentation meets the standard expected for this course and prepares you for real-world HPC projects.