Table of Contents
Purpose of the Final Project
The final project in this course is your opportunity to integrate everything you have learned into a coherent, realistic high performance computing workflow. Instead of isolated examples, you will design, execute, and analyze an end to end computation on an HPC system. The project is not about writing the most advanced code. It is about demonstrating that you can think and act like an HPC user: choosing appropriate tools, sizing and submitting jobs, handling data, measuring performance, and documenting your work so that others can reproduce it.
The hands on exercises that accompany the project are designed to de risk each major step. By the time you assemble the full project, you will already have practiced connecting to a cluster, compiling and running parallel codes, using a job scheduler, and performing basic performance analysis.
Learning Goals
The project and exercises focus on practical competence rather than theoretical depth. After completing them, you should be able to log in to a cluster, build and run a nontrivial application, scale it beyond a laptop, and argue with data about its performance and resource usage.
You should not expect to become an expert in any one library or framework within this single project. Instead, you should be able to show that you can make reasonable design choices. For example, you will decide whether to use shared memory, distributed memory, or a hybrid approach, and whether you need GPUs. You will then justify these choices using timing results, scaling tests, and resource usage reports.
The final deliverables will emphasize clarity and reproducibility. A well documented, modestly scaled project is more valuable than an ambitious but poorly supported one.
Components of the Final Project
The project is composed of four tightly connected parts that mirror a typical HPC workflow: problem design, implementation and execution, performance analysis, and documentation. Each part is evaluated both on technical correctness and on how clearly you communicate what you did and why.
In the design phase, you select an application domain or dataset and define a computational task with a meaningful workload. In the implementation and execution phase, you put this design into practice on an HPC system, using the appropriate programming model and job scheduler. In the performance phase, you carry out scaling or parameter studies and interpret the results in the context of the course concepts. In the documentation phase, you create a concise report and a set of artifacts that allow another user to rerun your work.
Throughout, you are expected to apply good resource stewardship, including thoughtful job sizing and minimal waste of CPU or GPU hours.
Designing an HPC Application
Design for the final project begins with a clear statement of the computational problem. You should define an input, an output, and a well specified process that transforms one into the other. The task must be large or demanding enough that running it on an HPC cluster is meaningful. A trivial script that finishes in seconds on a laptop is not sufficient. At the same time, it should remain manageable within the cluster quotas and wall time limits available to you.
You may either develop a simple parallel application yourself or adopt an existing HPC capable application or library. In either case, the design must describe what is being computed, what approximate problem size you will target, and which performance questions you intend to answer. For example, you might plan to study how runtime scales with the number of MPI processes for a fixed dataset, or how memory usage behaves for increasing grid resolutions.
Your design must also specify the primary programming model. You might use OpenMP, MPI, a hybrid MPI plus OpenMP configuration, or a GPU oriented model such as CUDA or OpenACC. The choice should follow naturally from the type of computation and the expected bottlenecks. You should not attempt to use all available technologies at once. Instead, choose one or two that you can use correctly.
The application design must include:
- A clearly defined computational problem.
- A justification for using an HPC system.
- A specified programming model (for example MPI, OpenMP, GPUs).
- A plan for at least one scaling or performance study.
When you design your application, think about data management from the beginning. Specify where input data will be stored, how intermediate results will be handled to avoid excessive I/O, and what outputs are essential. Avoid producing huge unnecessary output files that provide little insight into performance or correctness.
Scoping and Resource Planning
Proper scoping is critical for a successful final project. You need a problem that is large enough to benefit from parallel resources but small enough that you can iterate quickly and respect the cluster limits. To find this balance, you should first identify a minimal useful problem size that runs comfortably on a single node. Then, estimate how that workload scales as you increase resolution, number of timesteps, or number of samples.
A common approach is to select a baseline configuration that runs in a few minutes using modest resources. From that baseline, you can explore scaling to more cores, more nodes, or larger inputs without risking multi hour failures. This approach allows you to collect enough data points for strong or weak scaling plots while keeping cluster usage reasonable.
Resource planning should also consider memory, storage, and queue policies. You must ensure that your per task memory usage fits within the node configuration and that your total output can be stored in your quota. If the cluster uses separate partitions or constraints for CPU and GPU jobs, you must check that your requested resources correspond to a queue that you can access within the project timeline.
Before running large jobs, verify:
- Baseline runtime is a few minutes, not hours.
- Estimated memory per process or thread fits within node limits.
- Queues and partitions used are accessible and appropriate.
- Output size will not exceed your storage quota.
As you plan, keep in mind that some jobs will fail. You should reserve time and resources for debugging and reruns. It is better to plan for repeated short experiments than to rely on a single long run that might be cancelled or fail due to configuration errors.
Running Large Scale Simulations
Once your application is designed and tested on small inputs, you will prepare and execute larger runs on the cluster. This involves creating job scripts, specifying task counts, wall time, memory, and any partition or account options required by the local scheduler. Although the course has a dedicated chapter on job scheduling, here the focus is on using these tools coherently in a project setting.
You will move from interactive experimentation to fully batch driven workflows. Small tests can be run interactively on a login or development node if the policy allows it, but your main data and scaling runs must be submitted through the scheduler. Each run should be traceable to a particular configuration so that you can interpret its output correctly.
For the final project, you are expected to conduct at least one structured experiment that varies a single parameter systematically. For example, you may measure runtime for a fixed problem size as you increase the number of CPU cores, or you may measure runtime for increasing problem sizes while keeping total core count fixed. In both cases, the aim is to observe how your code behaves under more demanding workloads or with more parallel resources.
You must keep careful records of each job: the script used, the parameters passed to the application, the scheduler output, and any logs produced by your code. You will use this information later when you reconstruct your performance analysis and when you document how to reproduce your experiments.
Integrating Hands On Exercises
Before attempting your full project runs, you will complete a series of smaller exercises that mirror each project stage in isolation. These exercises are not separate from the project; they are stepping stones that help you verify that each component of your workflow is functional and that you understand how to use the cluster responsibly.
One exercise will focus on compiling and running a simple parallel code in your language of choice, including correct use of the compiler wrappers and relevant optimization flags. Another exercise will require you to write and submit a basic batch script, confirm that the job runs successfully, and interpret the scheduler output regarding wall time and resource usage. A further exercise will introduce you to performance measurement tools or simple timing mechanisms so that you can obtain reliable runtime data.
You are encouraged to adapt the templates and techniques from these exercises directly into your project. For instance, your final job scripts can be refinements of the ones used in the exercises, and your timing methodology can reuse the same timing functions or profiling tools. The exercises provide a safe environment to experiment with commands and options before applying them to your more expensive project runs.
Performance Analysis and Optimization Report
A core deliverable of the final project is a performance analysis and optimization report. The goal is not to demonstrate perfect scaling, but to show that you can measure performance quantitatively, interpret the results, and make reasoned improvements.
You will collect runtime data from multiple runs, usually parameterized by core count or problem size. From these measurements, you will compute speedup, efficiency, or other relevant metrics. For example, parallel speedup for $p$ processing elements relative to a single processing element is defined as
$$
S(p) = \frac{T(1)}{T(p)},
$$
where $T(1)$ is the runtime on one processing element and $T(p)$ is the runtime on $p$ processing elements. Parallel efficiency is then
$$
E(p) = \frac{S(p)}{p}.
$$
These quantities help you assess whether adding more resources actually produces substantial gains.
In your report you must:
- Present measured runtimes for multiple configurations.
- Compute and interpret speedup $S(p)$ and efficiency $E(p)$ where applicable.
- Identify at least one bottleneck or limitation.
- Attempt at least one targeted optimization and evaluate its impact.
Optimization at this stage is not about deep algorithmic changes. Instead, you might adjust compiler flags, change the number of threads per process, alter the mapping of tasks to nodes, improve I/O patterns, or refine problem size. Even such modest adjustments can reveal important performance tradeoffs and reinforce concepts like load balancing, communication overhead, and memory bandwidth limits.
Your report should include concise plots or tables, but the emphasis must be on the narrative. Explain what you expected to see, what you actually observed, and what conclusions you can draw. If results are surprising or imperfect, that is acceptable as long as you analyze them thoughtfully.
Ensuring Reproducibility
Reproducibility is a central theme of the final project. Another user with access to the same cluster should be able to follow your instructions and obtain equivalent results, subject to normal variability. Achieving this goal requires careful attention to your software environment, configuration management, and documentation.
You must specify the modules loaded, compiler versions, and any external libraries or containers used. If you rely on configuration files or custom scripts, they must be included in your submission and referenced clearly. Paths in your scripts should be relative or parameterized where possible so that another user can adapt them to their own home directory without confusion.
Input datasets should either be small enough to include with your project materials or clearly identified with instructions on how to obtain them. If you use a shared dataset on the cluster, document its location and any preprocessing steps. For very large datasets, you should provide a reduced sample that allows reviewers to verify the workflow even if they cannot reproduce the full scale runs.
Documentation and Best Practices Summary
Your final submission must include a concise but complete set of documentation that ties together all aspects of your project. This documentation serves both as a guide for reproduction and as a reflection on what you learned. It should not be a raw dump of notes but a structured narrative with clear sections that mirror the project lifecycle.
Typically, you will provide an overview of the problem, a description of the implementation and programming model, a summary of the computational environment, a presentation of your experiments and performance results, and a short discussion of limitations and possible future improvements. Where appropriate, include code snippets or command examples, but separate them from your main text to keep the narrative readable.
Many HPC projects benefit from a small set of supporting files in the project directory, such as a README that explains how to build and run the code, a configuration file that captures common job parameters, and example job scripts. For this course, you are expected to provide at least a minimal set of such artifacts.
Your final project package must contain:
- Source code or configuration files used for the computation.
- At least one example job script that runs a representative case.
- A
READMEor equivalent document with build and run instructions. - A written report describing problem, methods, experiments, and results.
As you prepare these materials, remember that clarity is more valuable than volume. A concise project that can be understood and rerun in a reasonable time is more successful than an oversized, opaque one. The ultimate measure of success is whether someone else can pick up your work, reproduce your results, and see clearly how you applied HPC concepts in practice.