20 Final Project and Hands-On Exercises

Table of Contents

Overview of the Final Project

The final part of this course is intentionally practical. Instead of just answering quiz questions, you will design, run, and analyze an HPC workload end‑to‑end, using the concepts you have learned.

This chapter describes:

What the final project looks like
Example project ideas
Expected deliverables
Evaluation criteria
How to structure your work
Suggested hands‑on exercises leading up to the project

The technical “how” (e.g., how to submit a SLURM job, how to use MPI, how to profile) is covered in earlier chapters. Here the focus is on turning those pieces into a coherent project and practice workflow.

Goals of the Final Project

By the end of the project you should be able to:

Design an HPC computation that benefits from parallelism
Choose appropriate resources (cores, nodes, memory, GPUs, wall clock time)
Run and manage jobs on an HPC cluster using a batch system
Collect performance data (runtime, scaling, resource usage)
Interpret results and identify bottlenecks
Document your setup so that someone else can reproduce your work

The project is intentionally open‑ended: you can work with provided examples or bring a small problem from your own domain, as long as it fits within the course constraints.

Project Constraints and Scope

To keep the project manageable for beginners, we impose some limits. A typical project should:

Run comfortably within the available cluster allocation (e.g., up to a few dozen CPU cores and possibly a single GPU)
Complete individual runs in at most 10–30 minutes of wall‑clock time
Use at least one form of parallelism (MPI, OpenMP, GPU, or a hybrid)
Include some basic performance analysis (timings, scaling, or profiling)
Be fully runnable by someone else with access to the same cluster or a similar system

You do not need to:

Build a production‑grade, fully optimized code
Achieve perfect scaling
Use all programming models or all hardware types

Depth and clarity matter more than size or complexity.

Types of Acceptable Projects

You can choose one of several broad project types. Discuss with your instructor which option fits your background and time.

1. Parallelization of an Existing Serial Code

Start from a provided serial reference implementation (e.g., a numerical kernel, simple simulation, or data processing script).
Introduce parallelism using OpenMP, MPI, or a GPU approach.
Compare performance and scaling of the serial and parallel versions.

Typical steps:

Understand the algorithm and identify the most time‑consuming part.
Select a parallelization strategy (e.g., loop parallelism, domain decomposition).
Implement a minimal but correct parallel version.
Run scaling experiments (vary thread count, processes, or nodes).
Analyze speedup and bottlenecks.

2. Scaling Study of a Preexisting Parallel Application

Use a preinstalled HPC application (e.g., a CFD, MD, or FEM package, or a machine learning framework).
Do not modify its source code; instead focus on how it behaves on the cluster.
Perform strong and/or weak scaling tests with a realistic problem size.

Typical steps:

Learn how to run the application via job scripts.
Choose a test case and define what you will measure.
Run the same case with different core counts or nodes.
Analyze how performance changes and why (I/O, communication, load imbalance, etc.).

3. End‑to‑End Workflow Project

Focus on a full workflow rather than heavy coding.
Combine: data preparation → main computation → postprocessing and visualization.
Emphasize automation, job orchestration, and reproducibility.

Typical steps:

Write scripts for each stage (e.g., data generation, simulation, analysis).
Integrate them with the batch system (job dependencies, arrays).
Log configuration, environment modules, and parameters.
Measure throughput and resource utilization for the whole workflow.

4. Mini Application from Your Domain

Propose a small but complete computation relevant to your research or interests.
Implementation can be in a language of your choice, provided it can run on the cluster.
Must be constrained to fit the course timeline and resource limits.

Typical steps:

Define a clear, limited problem (not your entire research project).
Identify where parallelism is natural (e.g., independent tasks, data chunks).
Implement a prototype and demonstrate at least one scaling experiment.
Relate the results back to the characteristics of your problem.

Core Deliverables

Every project must produce four main artifacts:

Code and job scripts

Source code files
Job submission scripts for the batch scheduler
Any auxiliary scripts (data generation, plotting, postprocessing)

Configuration and environment description

Which modules or software stacks you used
Compiler and key compilation flags
Hardware and job parameters (nodes, tasks, threads, memory, GPUs)

Performance and scaling results

Raw measurements (execution times, iterations per second, throughput, etc.)
At least one set of strong or weak scaling results
A short interpretation of the results

Written report

See the next section for a suggested structure

You may also be asked for:

A brief presentation (slides or live demo)
A public or internal repository with your code and documentation

Suggested Report Structure

Keep the report concise but complete; 5–10 pages is typically sufficient if it is well organized.

1. Introduction

Brief description of the problem you are addressing
Why it is (or could be) relevant to HPC
Which type of project you chose (parallelization, scaling study, etc.)

2. Methods and Implementation

Programming language(s) and libraries used
Parallelization approach (e.g., MPI domain decomposition, OpenMP loop parallelism, GPU kernels)
High‑level description of the algorithm and data decomposition (do not repeat full textbooks)
Key design choices and trade‑offs

Focus on what is specific to your implementation, not on re‑explaining basic parallel computing theory.

3. Experimental Setup

Description of the HPC system used

CPU type, number of cores per node
Presence of GPUs or special accelerators (if relevant)
Interconnect (only as needed to interpret results)

Software environment

Modules loaded or container image used
Compiler, version, and major optimization flags

Job parameters

Node count, tasks per node, threads per task
Problem sizes
Wall‑time limits and any resource constraints

4. Results

Present numerical measurements rather than only qualitative statements.

Examples of what to include:

Execution times for different core counts or nodes
Speedup and efficiency:
$$
\text{Speedup}(p) = \frac{T_1}{T_p}, \quad
\text{Efficiency}(p) = \frac{\text{Speedup}(p)}{p}
$$
where $T_1$ is the runtime with one processing unit and $T_p$ with $p$ units.
Throughput metrics appropriate to your problem (e.g., time steps per second, matrix solves per second, samples per second)
Resource utilization highlights (e.g., memory usage, GPU occupancy if available from tools)

Use tables and basic plots (e.g., runtime vs. cores, speedup vs. cores) to make trends clear.

5. Discussion

Interpret the results in light of what you learned in the course:

Did your code scale as expected? Where does it start to saturate?
Which factors limit performance (communication, I/O, memory bandwidth, load imbalance, algorithmic issues)?
How do strong vs. weak scaling behaviors differ, if you tested both?
How do different settings (e.g., threading vs. pure MPI, different problem sizes) affect performance?

You do not need extremely detailed performance modeling, but you should connect observations to plausible causes.

6. Limitations and Future Work

What you did not implement due to time or resource constraints
Obvious next steps (e.g., better load balancing, more advanced algorithms, GPU porting)
How the project could be scaled up on larger systems or with more time

7. Reproducibility Notes

Exact commands to build and run your code
Any random seeds or configuration files
Where input data and expected outputs are stored
How another user could repeat at least one of your main experiments

Evaluation Criteria

While exact grading rubrics vary, projects are commonly evaluated on:

Correctness and robustness

The code runs and produces reasonable, consistent results
The parallel version is logically correct (no obvious race conditions, deadlocks, or incorrect outputs for tested cases)

Appropriate use of HPC resources

Jobs request realistic resources (no massive over‑allocation for tiny tasks)
The chosen form of parallelism is sensible for the problem
Basic job management practices are followed (batch submission, not interactive overload)

Performance investigation

You collected meaningful data (timings, scaling) rather than single anecdotal runs
You attempted at least basic performance improvements
You can explain performance trends in a reasoned way

Quality of documentation

Clear, organized report
Enough detail to understand and reproduce the work
Transparent about limitations and problems encountered

Professional practices

Clean structure of code and scripts
Version control usage when possible
Attention to reproducibility and environment management

Creativity and ambition are valued, but a smaller, well‑executed project is preferable to a grand plan that never fully works.

Recommended Workflow for the Project

To prevent last‑minute surprises, follow a staged approach.

Stage 1: Define the Problem and Plan

Choose your project type and specific topic.
Write a short proposal (1–2 pages):

Aim of the project
Planned methods and tools
Expected outputs and performance tests

Verify feasibility with your instructor or mentor.

Stage 2: Get a Correct Baseline

Implement or obtain a serial or reference version.
Validate correctness on small inputs:

Check against known results or invariants.
Add simple tests you can run quickly.

Create the initial job script and verify the code runs on the cluster.

Focus on correctness before performance.

Stage 3: Introduce Parallelism or Scaling

Add the chosen parallel mechanism(s) to your code, or set up scaling runs for existing software.
Start with very small configurations (few cores, small data) to debug parallel logic.
Once it works, gradually increase resource counts and problem size.

Make sure to keep the serial or single‑process version intact for comparison.

Stage 4: Systematic Experiments

Plan a small but structured experiment matrix, for example:

Vary number of cores: $1, 2, 4, 8, 16$ (or as available)
Optionally vary problem size (for weak scaling)
For hybrid or GPU codes, vary relevant parameters (threads per task, number of GPUs)

For each configuration:

Record all parameters (nodes, tasks, threads, input size, environment modules)
Run multiple trials if runtime is highly variable and average the results
Save logs and job output files systematically

Stage 5: Analyze and Visualize

Compute speedups, efficiencies, or other metrics.
Create basic plots:

Runtime vs. cores
Speedup vs. cores
Efficiency vs. cores

Identify where scaling degrades and hypothesize reasons.

Use profiling or logging information as needed to support your explanations.

Stage 6: Finalize Report and Packaging

Clean up and comment your code.
Make sure job scripts are generic and not tied to temporary directories or personal paths.
Verify that your repository or project directory includes:

Source code
Job scripts
Instructions (README)
Key configuration files
Plots and data used for the report

Write and proofread the report, checking that all figures are readable and labeled, and that another reader could follow your workflow.

Progressive Hands‑On Exercises

Before or alongside the final project, you are encouraged (or may be required) to complete smaller hands‑on exercises. These are designed to practice individual skills you will need for the project.

Below is a suggested progression; details such as exact commands or little code snippets are covered in earlier chapters.

Exercise 1: Basic Job Submission

Write a minimal batch script that:

Requests a small number of cores for a short time
Runs a simple CPU‑bound program (e.g., computing $\pi$ by numerical integration)

Submit the job, check its status, and inspect the output files.
Record what happens if you request too many resources or too little wall time.

Goal: Become comfortable with the batch system and job lifecycle.

Exercise 2: Thread‑Level Parallelism

Take a simple loop‑based computation.
Parallelize it using OpenMP (or a similar threading API).
Run with different numbers of threads.
Measure execution times and produce a small speedup table.

Goal: Practice thread control, environment variables, and basic performance measurement.

Exercise 3: Process‑Level Parallelism

Use a small MPI‑based example program.
Run it with different numbers of processes and, optionally, across multiple nodes.
Observe how execution time changes.
Experiment with different process layouts if the batch system allows it.

Goal: Understand multi‑process execution and mapping onto cluster resources.

Exercise 4: Simple Scaling Study

Select a small parallel application (provided or from an earlier exercise).
Conduct a mini strong scaling test:

Fix the problem size.
Double the resources stepwise (e.g., 1, 2, 4, 8 processes).

Plot runtime or speedup as a function of process count.

Goal: Bridge between “it runs” and “how well does it scale,” in preparation for the project.

Exercise 5: Workflow and Reproducibility

Chain two or three steps:

Data generation
Main computation
Postprocessing (e.g., statistics or plots)

Automate the workflow with job dependencies or a simple script.
Record the exact environment (modules, versions) and store configuration in a text file.

Goal: Practice constructing small but repeatable HPC workflows.

Collaboration and Academic Integrity

Collaboration policies vary by course; follow your instructor’s rules. Common guidelines include:

You may discuss general ideas and debugging strategies with classmates.
Each student (or approved project pair/group) must:

Write their own code (except for provided templates) and job scripts.
Produce their own report and analysis.

If you reuse any external code, libraries, or scripts:

Cite them clearly in your report.
Distinguish between your own contributions and third‑party components.

Transparency is crucial: clearly stating what is yours, what is adapted, and what is used as‑is is part of professional HPC practice.

Practical Tips for a Successful Project

Start early; cluster queues and debugging can introduce delays.
Keep runs small and fast while prototyping; scale up only when things are stable.
Save intermediate results and logs systematically; you will need them for the report.
Expect some failures (e.g., jobs killed by limits, numerical crashes) and budget time to investigate them.
Ask for help when you are stuck on infrastructure or environment issues; do not lose days on configuration problems.

The final project is your opportunity to integrate everything you have learned about HPC into a concrete, working example. Treat it as a miniature version of real‑world HPC work: design, implement, run, measure, understand, and clearly communicate your results.

20 Final Project and Hands-On Exercises

Overview of the Final Project

Goals of the Final Project

Project Constraints and Scope

Types of Acceptable Projects

1. Parallelization of an Existing Serial Code

2. Scaling Study of a Preexisting Parallel Application

3. End‑to‑End Workflow Project

4. Mini Application from Your Domain

Core Deliverables

Suggested Report Structure

1. Introduction

2. Methods and Implementation

3. Experimental Setup

4. Results

5. Discussion

6. Limitations and Future Work

7. Reproducibility Notes

Evaluation Criteria

Recommended Workflow for the Project

Stage 1: Define the Problem and Plan

Stage 2: Get a Correct Baseline

Stage 3: Introduce Parallelism or Scaling

Stage 4: Systematic Experiments

Stage 5: Analyze and Visualize

Stage 6: Finalize Report and Packaging

Progressive Hands‑On Exercises

Exercise 1: Basic Job Submission

Exercise 2: Thread‑Level Parallelism

Exercise 3: Process‑Level Parallelism

Exercise 4: Simple Scaling Study

Exercise 5: Workflow and Reproducibility

Collaboration and Academic Integrity

Practical Tips for a Successful Project

Comments

Where to Move