Table of Contents
Overview of the Final Project
The final part of this course is intentionally practical. Instead of just answering quiz questions, you will design, run, and analyze an HPC workload end‑to‑end, using the concepts you have learned.
This chapter describes:
- What the final project looks like
- Example project ideas
- Expected deliverables
- Evaluation criteria
- How to structure your work
- Suggested hands‑on exercises leading up to the project
The technical “how” (e.g., how to submit a SLURM job, how to use MPI, how to profile) is covered in earlier chapters. Here the focus is on turning those pieces into a coherent project and practice workflow.
Goals of the Final Project
By the end of the project you should be able to:
- Design an HPC computation that benefits from parallelism
- Choose appropriate resources (cores, nodes, memory, GPUs, wall clock time)
- Run and manage jobs on an HPC cluster using a batch system
- Collect performance data (runtime, scaling, resource usage)
- Interpret results and identify bottlenecks
- Document your setup so that someone else can reproduce your work
The project is intentionally open‑ended: you can work with provided examples or bring a small problem from your own domain, as long as it fits within the course constraints.
Project Constraints and Scope
To keep the project manageable for beginners, we impose some limits. A typical project should:
- Run comfortably within the available cluster allocation (e.g., up to a few dozen CPU cores and possibly a single GPU)
- Complete individual runs in at most 10–30 minutes of wall‑clock time
- Use at least one form of parallelism (MPI, OpenMP, GPU, or a hybrid)
- Include some basic performance analysis (timings, scaling, or profiling)
- Be fully runnable by someone else with access to the same cluster or a similar system
You do not need to:
- Build a production‑grade, fully optimized code
- Achieve perfect scaling
- Use all programming models or all hardware types
Depth and clarity matter more than size or complexity.
Types of Acceptable Projects
You can choose one of several broad project types. Discuss with your instructor which option fits your background and time.
1. Parallelization of an Existing Serial Code
- Start from a provided serial reference implementation (e.g., a numerical kernel, simple simulation, or data processing script).
- Introduce parallelism using OpenMP, MPI, or a GPU approach.
- Compare performance and scaling of the serial and parallel versions.
Typical steps:
- Understand the algorithm and identify the most time‑consuming part.
- Select a parallelization strategy (e.g., loop parallelism, domain decomposition).
- Implement a minimal but correct parallel version.
- Run scaling experiments (vary thread count, processes, or nodes).
- Analyze speedup and bottlenecks.
2. Scaling Study of a Preexisting Parallel Application
- Use a preinstalled HPC application (e.g., a CFD, MD, or FEM package, or a machine learning framework).
- Do not modify its source code; instead focus on how it behaves on the cluster.
- Perform strong and/or weak scaling tests with a realistic problem size.
Typical steps:
- Learn how to run the application via job scripts.
- Choose a test case and define what you will measure.
- Run the same case with different core counts or nodes.
- Analyze how performance changes and why (I/O, communication, load imbalance, etc.).
3. End‑to‑End Workflow Project
- Focus on a full workflow rather than heavy coding.
- Combine: data preparation → main computation → postprocessing and visualization.
- Emphasize automation, job orchestration, and reproducibility.
Typical steps:
- Write scripts for each stage (e.g., data generation, simulation, analysis).
- Integrate them with the batch system (job dependencies, arrays).
- Log configuration, environment modules, and parameters.
- Measure throughput and resource utilization for the whole workflow.
4. Mini Application from Your Domain
- Propose a small but complete computation relevant to your research or interests.
- Implementation can be in a language of your choice, provided it can run on the cluster.
- Must be constrained to fit the course timeline and resource limits.
Typical steps:
- Define a clear, limited problem (not your entire research project).
- Identify where parallelism is natural (e.g., independent tasks, data chunks).
- Implement a prototype and demonstrate at least one scaling experiment.
- Relate the results back to the characteristics of your problem.
Core Deliverables
Every project must produce four main artifacts:
- Code and job scripts
- Source code files
- Job submission scripts for the batch scheduler
- Any auxiliary scripts (data generation, plotting, postprocessing)
- Configuration and environment description
- Which modules or software stacks you used
- Compiler and key compilation flags
- Hardware and job parameters (nodes, tasks, threads, memory, GPUs)
- Performance and scaling results
- Raw measurements (execution times, iterations per second, throughput, etc.)
- At least one set of strong or weak scaling results
- A short interpretation of the results
- Written report
- See the next section for a suggested structure
You may also be asked for:
- A brief presentation (slides or live demo)
- A public or internal repository with your code and documentation
Suggested Report Structure
Keep the report concise but complete; 5–10 pages is typically sufficient if it is well organized.
1. Introduction
- Brief description of the problem you are addressing
- Why it is (or could be) relevant to HPC
- Which type of project you chose (parallelization, scaling study, etc.)
2. Methods and Implementation
- Programming language(s) and libraries used
- Parallelization approach (e.g., MPI domain decomposition, OpenMP loop parallelism, GPU kernels)
- High‑level description of the algorithm and data decomposition (do not repeat full textbooks)
- Key design choices and trade‑offs
Focus on what is specific to your implementation, not on re‑explaining basic parallel computing theory.
3. Experimental Setup
- Description of the HPC system used
- CPU type, number of cores per node
- Presence of GPUs or special accelerators (if relevant)
- Interconnect (only as needed to interpret results)
- Software environment
- Modules loaded or container image used
- Compiler, version, and major optimization flags
- Job parameters
- Node count, tasks per node, threads per task
- Problem sizes
- Wall‑time limits and any resource constraints
4. Results
Present numerical measurements rather than only qualitative statements.
Examples of what to include:
- Execution times for different core counts or nodes
- Speedup and efficiency:
$$
\text{Speedup}(p) = \frac{T_1}{T_p}, \quad
\text{Efficiency}(p) = \frac{\text{Speedup}(p)}{p}
$$
where $T_1$ is the runtime with one processing unit and $T_p$ with $p$ units. - Throughput metrics appropriate to your problem (e.g., time steps per second, matrix solves per second, samples per second)
- Resource utilization highlights (e.g., memory usage, GPU occupancy if available from tools)
Use tables and basic plots (e.g., runtime vs. cores, speedup vs. cores) to make trends clear.
5. Discussion
Interpret the results in light of what you learned in the course:
- Did your code scale as expected? Where does it start to saturate?
- Which factors limit performance (communication, I/O, memory bandwidth, load imbalance, algorithmic issues)?
- How do strong vs. weak scaling behaviors differ, if you tested both?
- How do different settings (e.g., threading vs. pure MPI, different problem sizes) affect performance?
You do not need extremely detailed performance modeling, but you should connect observations to plausible causes.
6. Limitations and Future Work
- What you did not implement due to time or resource constraints
- Obvious next steps (e.g., better load balancing, more advanced algorithms, GPU porting)
- How the project could be scaled up on larger systems or with more time
7. Reproducibility Notes
- Exact commands to build and run your code
- Any random seeds or configuration files
- Where input data and expected outputs are stored
- How another user could repeat at least one of your main experiments
Evaluation Criteria
While exact grading rubrics vary, projects are commonly evaluated on:
- Correctness and robustness
- The code runs and produces reasonable, consistent results
- The parallel version is logically correct (no obvious race conditions, deadlocks, or incorrect outputs for tested cases)
- Appropriate use of HPC resources
- Jobs request realistic resources (no massive over‑allocation for tiny tasks)
- The chosen form of parallelism is sensible for the problem
- Basic job management practices are followed (batch submission, not interactive overload)
- Performance investigation
- You collected meaningful data (timings, scaling) rather than single anecdotal runs
- You attempted at least basic performance improvements
- You can explain performance trends in a reasoned way
- Quality of documentation
- Clear, organized report
- Enough detail to understand and reproduce the work
- Transparent about limitations and problems encountered
- Professional practices
- Clean structure of code and scripts
- Version control usage when possible
- Attention to reproducibility and environment management
Creativity and ambition are valued, but a smaller, well‑executed project is preferable to a grand plan that never fully works.
Recommended Workflow for the Project
To prevent last‑minute surprises, follow a staged approach.
Stage 1: Define the Problem and Plan
- Choose your project type and specific topic.
- Write a short proposal (1–2 pages):
- Aim of the project
- Planned methods and tools
- Expected outputs and performance tests
- Verify feasibility with your instructor or mentor.
Stage 2: Get a Correct Baseline
- Implement or obtain a serial or reference version.
- Validate correctness on small inputs:
- Check against known results or invariants.
- Add simple tests you can run quickly.
- Create the initial job script and verify the code runs on the cluster.
Focus on correctness before performance.
Stage 3: Introduce Parallelism or Scaling
- Add the chosen parallel mechanism(s) to your code, or set up scaling runs for existing software.
- Start with very small configurations (few cores, small data) to debug parallel logic.
- Once it works, gradually increase resource counts and problem size.
Make sure to keep the serial or single‑process version intact for comparison.
Stage 4: Systematic Experiments
Plan a small but structured experiment matrix, for example:
- Vary number of cores: $1, 2, 4, 8, 16$ (or as available)
- Optionally vary problem size (for weak scaling)
- For hybrid or GPU codes, vary relevant parameters (threads per task, number of GPUs)
For each configuration:
- Record all parameters (nodes, tasks, threads, input size, environment modules)
- Run multiple trials if runtime is highly variable and average the results
- Save logs and job output files systematically
Stage 5: Analyze and Visualize
- Compute speedups, efficiencies, or other metrics.
- Create basic plots:
- Runtime vs. cores
- Speedup vs. cores
- Efficiency vs. cores
- Identify where scaling degrades and hypothesize reasons.
Use profiling or logging information as needed to support your explanations.
Stage 6: Finalize Report and Packaging
- Clean up and comment your code.
- Make sure job scripts are generic and not tied to temporary directories or personal paths.
- Verify that your repository or project directory includes:
- Source code
- Job scripts
- Instructions (README)
- Key configuration files
- Plots and data used for the report
Write and proofread the report, checking that all figures are readable and labeled, and that another reader could follow your workflow.
Progressive Hands‑On Exercises
Before or alongside the final project, you are encouraged (or may be required) to complete smaller hands‑on exercises. These are designed to practice individual skills you will need for the project.
Below is a suggested progression; details such as exact commands or little code snippets are covered in earlier chapters.
Exercise 1: Basic Job Submission
- Write a minimal batch script that:
- Requests a small number of cores for a short time
- Runs a simple CPU‑bound program (e.g., computing $\pi$ by numerical integration)
- Submit the job, check its status, and inspect the output files.
- Record what happens if you request too many resources or too little wall time.
Goal: Become comfortable with the batch system and job lifecycle.
Exercise 2: Thread‑Level Parallelism
- Take a simple loop‑based computation.
- Parallelize it using OpenMP (or a similar threading API).
- Run with different numbers of threads.
- Measure execution times and produce a small speedup table.
Goal: Practice thread control, environment variables, and basic performance measurement.
Exercise 3: Process‑Level Parallelism
- Use a small MPI‑based example program.
- Run it with different numbers of processes and, optionally, across multiple nodes.
- Observe how execution time changes.
- Experiment with different process layouts if the batch system allows it.
Goal: Understand multi‑process execution and mapping onto cluster resources.
Exercise 4: Simple Scaling Study
- Select a small parallel application (provided or from an earlier exercise).
- Conduct a mini strong scaling test:
- Fix the problem size.
- Double the resources stepwise (e.g., 1, 2, 4, 8 processes).
- Plot runtime or speedup as a function of process count.
Goal: Bridge between “it runs” and “how well does it scale,” in preparation for the project.
Exercise 5: Workflow and Reproducibility
- Chain two or three steps:
- Data generation
- Main computation
- Postprocessing (e.g., statistics or plots)
- Automate the workflow with job dependencies or a simple script.
- Record the exact environment (modules, versions) and store configuration in a text file.
Goal: Practice constructing small but repeatable HPC workflows.
Collaboration and Academic Integrity
Collaboration policies vary by course; follow your instructor’s rules. Common guidelines include:
- You may discuss general ideas and debugging strategies with classmates.
- Each student (or approved project pair/group) must:
- Write their own code (except for provided templates) and job scripts.
- Produce their own report and analysis.
- If you reuse any external code, libraries, or scripts:
- Cite them clearly in your report.
- Distinguish between your own contributions and third‑party components.
Transparency is crucial: clearly stating what is yours, what is adapted, and what is used as‑is is part of professional HPC practice.
Practical Tips for a Successful Project
- Start early; cluster queues and debugging can introduce delays.
- Keep runs small and fast while prototyping; scale up only when things are stable.
- Save intermediate results and logs systematically; you will need them for the report.
- Expect some failures (e.g., jobs killed by limits, numerical crashes) and budget time to investigate them.
- Ask for help when you are stuck on infrastructure or environment issues; do not lose days on configuration problems.
The final project is your opportunity to integrate everything you have learned about HPC into a concrete, working example. Treat it as a miniature version of real‑world HPC work: design, implement, run, measure, understand, and clearly communicate your results.