20.3 Performance analysis and optimization report

Table of Contents

Purpose of the Performance Analysis and Optimization Report

The performance analysis and optimization report is the final step of your project. It shows that you can not only run a program on an HPC system, but also measure how well it runs, interpret the results, and make justified improvements. In the report, you move from raw timings and tool output to clear conclusions and concrete code or configuration changes.

The audience for this report is a technically literate reader who has not followed your work day by day. They should be able to understand what you did, why you did it, and what you learned, without reading your source code line by line.

Overall Structure of the Report

A clear, consistent structure is more important than writing style. A common and effective structure for an HPC performance report is:

Problem and setup overview
Baseline performance characterization
Bottleneck analysis
Optimization plan and changes
Post-optimization results and comparison
Discussion and lessons learned
Reproducibility checklist

You can adapt section names, but you should include all of these elements.

Your report must always connect measurements → identified bottleneck → specific optimization → new measurements. Numbers without interpretation, or optimizations without evidence, are not sufficient.

Problem and Setup Overview

Begin with a short description of the application and what “good performance” means for it. This is not a full scientific introduction, only enough context for the performance discussion.

Describe:

The application goal. For example, “3D heat diffusion solver using finite differences” or “Molecular dynamics mini-application”.

The main parallel model. For example, “MPI-only,” “MPI + OpenMP,” or “CUDA + MPI.” Details of MPI or OpenMP syntax belong in earlier chapters, so here you only state what is used.

The primary performance metric. Common choices are total time to solution, simulated time per real second, throughput (tasks per second), or cost per iteration or time step.

Then specify the environment in which you collected results. This should include:

Cluster name or platform identifier.

Compiler and key options, such as -O3 or -g.

Important library or tool versions if relevant to performance, for example a specific MPI or math library.

Any important hardware characteristics that strongly affect your results, such as number of cores per node or presence of GPUs. You do not need a full hardware description if the course already provided one.

The goal is that another person who has access to the same system could, in principle, reproduce your results.

Baseline Performance Characterization

The next step is to establish how the unoptimized or minimally optimized version behaves. This is the reference against which you will compare everything else.

Describe how you obtained the baseline:

Which code version or git commit you used.

The input size or problem parameters. For example, grid size, number of particles, or length of the simulation run.

The resources used. For example, “1 node, 32 MPI processes” or “4 nodes, 4 MPI processes per node, 8 OpenMP threads per process.”

Run the code multiple times and report a stable statistic, such as the median or average runtime. For short runs, variation can be significant, so mention how many runs you averaged.

You should report at least:

Wall clock time of the whole run, $T_{\text{total}}$.

If applicable, a breakdown into major phases, such as initialization, main loop, I/O, and finalization.

If your application has a clearly defined “unit of work,” such as time steps or iterations, compute:

$$\text{Time per unit} = \frac{T_{\text{total}}}{\text{Number of units}}.$$

Include a small, well-labeled table or figure with baseline results, but keep it simple. The aim is to make the following sections understandable, not to impress with formatting.

Bottleneck Analysis Methodology

Once you have a baseline, your report must show how you identified where time is actually spent and what limits further improvement. This is the “analysis” part, and it should use data rather than guesswork.

You are expected to use tools introduced earlier in the course, such as timers, profilers, and system-level metrics. In the report, do not explain how each tool works in general. Instead, be explicit about:

Which tools you used, for example time, perf, gprof, nvprof, or vendor-specific profilers.

Which key metrics from each tool you relied on.

How those metrics led you to specific conclusions.

Common metrics include:

Function or region level breakdowns. For example, “70 percent of the runtime is in function compute_flux.”

Fraction of time spent in MPI or synchronization calls. For example, “25 percent of time in MPI_Allreduce.”

Cache miss rates, memory bandwidth, or instructions per cycle (IPC), if available.

GPU kernel execution times and occupancy for accelerator-based code.

Communication volume and time between processes.

Explain briefly how you measured these, for example: “We added MPI_Wtime based timers around the main solver loop” or “We ran the code under profiler X and collected the default summary report.”

Your analysis should move from raw measurements to specific bottlenecks. Typical bottleneck types include:

Computation limited. A few functions dominate the runtime, and hardware performance counters suggest that arithmetic throughput is the limiting factor.

Memory or cache limited. High cache miss rates or high memory bandwidth usage, with low arithmetic intensity.

Communication limited. A large share of time spent waiting in MPI calls or on collective operations.

Load imbalance. Some processes or threads finish much later than others. Timeline or rank-level statistics reveal this imbalance.

I/O limited. Significant time in reading or writing data compared to computation.

In the report, name the dominant bottleneck or bottlenecks. Support your claim with one or two key numbers from your measurements, not with full tool dumps.

A performance bottleneck must be backed by quantitative evidence. Statements such as “the code is probably memory bound” without data are not acceptable in a performance report.

Designing an Optimization Plan

After identifying bottlenecks, you should propose a focused optimization plan. The report does not need to list every possible idea you considered, but it should show that your optimizations are driven by your analysis.

For each bottleneck you choose to address, state:

The bottleneck in precise terms, for example “80 percent of runtime in compute_rhs, which is memory bound.”

Your hypothesis about why this occurs, such as “The loop in compute_rhs strides through memory with poor locality.”

The planned optimization, which might target algorithms, data layout, parallelization strategy, or low-level tuning.

You must keep the scope reasonable. For a course project it is better to address one or two major bottlenecks well than to make many small, undocumented tweaks. Make clear which aspects you will not attempt to optimize and why, for example, limited time or risk to correctness.

At this stage you should also define what “success” means numerically. For example:

Target reduction in total runtime, such as “aiming for 2× speedup on 2 nodes.”

Better parallel efficiency when moving from 1 node to 4 nodes, expressed as a percentage.

Reduced time in a specific function or phase, such as “cut time in I/O from 30 percent to below 10 percent.”

Having explicit targets makes your later evaluation more meaningful.

Describing Implemented Optimizations

This is where you document what you actually changed, how, and why. You do not need to reproduce full source files in the report, but you should describe each significant optimization at an appropriate level of detail.

For each optimization, you should include:

A short, descriptive title, such as “Reordering loops in compute_flux for better locality” or “Reducing collective communication in time stepping loop.”

The rationale, which connects directly back to the identified bottleneck.

A concise description of the code or configuration change. You can show the essence of a change with a small code excerpt inside a code block.

For example, to illustrate loop reordering:

/* Before */
for (k = 0; k < nz; k++) {
  for (j = 0; j < ny; j++) {
    for (i = 0; i < nx; i++) {
      a[i][j][k] = ...
    }
  }
}
/* After: iterate in memory-contiguous order */
for (i = 0; i < nx; i++) {
  for (j = 0; j < ny; j++) {
    for (k = 0; k < nz; k++) {
      a[i][j][k] = ...
    }
  }
}

Any changes to parallelization layout, such as changing the number of MPI processes per node or threads per process, should also be described, together with the motivation. For example, you might reduce MPI process count to decrease communication or increase threads per process to improve cache sharing.

If you use compiler flags or library calls that affect performance, list the key ones once, such as use of -O3, -march=native, or linking against a specific BLAS library.

The report must also explicitly state how you preserved correctness, for example by:

Running the same input with the new code and comparing results to the baseline.

Checking that differences are within acceptable numerical tolerance for floating point algorithms.

Even when you do not show all test cases, mention your verification strategy.

Measuring and Presenting Post-Optimization Results

After changes are implemented, you repeat measurements with the same care as for the baseline. Your report must present pre and post results in a form that makes comparison obvious.

Measure:

Total wall clock time and, if appropriate, time per iteration or time step.

The same breakdown by functions or phases that you used in the baseline.

Where relevant, scaling behavior when changing the number of cores, processes, or nodes.

From these measurements, compute speedups:

$$S = \frac{T_{\text{baseline}}}{T_{\text{optimized}}}.$$

For example, if the baseline runtime was 100 seconds and the new runtime is 50 seconds, then $S = 2$.

If you changed parallel scaling behavior, report parallel efficiency when running on $P$ processing elements:

$$E(P) = \frac{T(1)}{P \cdot T(P)}.$$

Here $T(1)$ is the runtime on a single processing element and $T(P)$ is the runtime on $P$ processing elements. If you cannot measure $T(1)$ directly because the problem does not fit, explain what you used instead as a reference.

Where possible, use the same inputs and resources as in the baseline. If you must change them, justify the change clearly.

Use simple tables or plots with clear labels. For text-based reports, a compact table that includes:

Baseline time.

Optimized time.

Speedup.

Parallel efficiency (if tested).

is usually sufficient.

Always report both absolute times and relative improvements. A large percentage speedup from a very small baseline time can be less meaningful than a modest improvement in a dominant phase.

Interpreting Results and Remaining Bottlenecks

In your discussion of results, move beyond “it is faster” and address why the observed changes make sense.

Explain, for each major optimization:

Whether it achieved the intended effect on the targeted metric.

How much it contributed to overall speedup.

Whether it introduced any trade-offs, such as increased memory usage or more complex code.

Sometimes an optimization improves one metric but worsens another. For example, you might decrease computation time but increase communication, or vice versa. A good report makes these trade-offs explicit.

You should also revisit bottlenecks after optimization. Use profiling or timing again to see where the code now spends its time. Often, a previously small cost becomes dominant after optimizing the old hotspot.

Discuss any remaining limitations, such as:

Residual load imbalance that you did not have time to fix.

Poor scaling beyond a small number of nodes.

Algorithmic constraints that prevent further improvement.

This section does not need to propose new work in detail, but it should show awareness that optimization is an iterative process.

Connecting to Scalability and Efficiency Concepts

Your final report should connect your experience with the scaling concepts covered earlier in the course. You do not need to re-derive formal laws, but you should relate your observations to them.

For example, if you measure strong scaling, note how the speedup curve deviates from ideal linear scaling. Discuss reasons such as communication overhead, serial fractions of the code, or increased contention for shared resources.

You can write informally, for instance:

“For problems of fixed size, we observed diminishing returns beyond 8 nodes. Profiling showed that the serial setup phase and communication-heavy halo exchanges dominated at higher core counts, consistent with the idea that the non-parallel parts limit scalability.”

When analyzing weak scaling, mention whether runtime per problem unit stays roughly constant as you increase both the problem size and the number of resources. Connect deviations from ideal behavior to specific features of your code, such as growth in communication volume or metadata overhead.

Including a brief link between your numbers and the scaling ideas shows that you understand not only how to measure performance but also how to reason about it.

Common Pitfalls in Performance Reporting

A useful part of your report is to avoid common mistakes that weaken its conclusions. While you do not need a separate section on mistakes, be aware of these issues as you write:

Do not rely on single runs for timing if variability is significant. Average or otherwise summarize multiple runs.

Do not mix optimizations without tracking them. If you change many things at once, you cannot attribute improvements to specific actions.

Do not change problem size between baseline and optimized runs without explanation. If you must change size, emphasize that results are not directly comparable.

Do not overclaim. For example, avoid stating that an optimization is “optimal” or that it “fully eliminates communication overhead” unless you have very strong evidence.

Do not ignore correctness. Every optimization must be checked for accuracy.

You can briefly mention in your report if you encountered any such pitfalls and how you addressed them, which shows critical thinking.

Reproducibility Checklist

An essential part of an HPC performance report is making your work reproducible by others. Conclude with a concise reproducibility section that collects practical information spread throughout the report.

Include:

Repository or archive location of the code, if applicable, and the specific branch or commit used for the final results.

Brief build instructions, including compiler, important flags, and any required modules.

A list of job scripts or command lines used to run key experiments. You can include one or two representative scripts as code blocks.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:10:00
module load gcc/12.2 openmpi/4.1
srun ./solver -nx 512 -ny 512 -nz 256 -steps 100

A concise description of input parameters for each major result table or figure.

Any environment variables or settings that significantly affect performance, for example variables controlling thread affinity or GPU selection.

You do not need to reproduce every intermediate experiment, but someone following this checklist should be able to reproduce the main baseline and final optimized results on the same system.

A performance report is not complete without enough information for another person to rebuild the code and rerun at least the key experiments.

Reflecting on the Optimization Process

End your report with a short reflection on the process. This is not a general conclusion about HPC, but a specific summary of what you learned while analyzing and optimizing your application. For example, you might mention that:

Initial guesses about the bottleneck were incorrect until you used a profiler.

A simple algorithmic change had more impact than low-level code tweaks.

Communication or I/O became the limiting factor after compute optimizations.

This reflection shows that you can connect technical steps with broader insight about performance engineering. It also prepares you for future projects where performance analysis and optimization will be an ongoing, structured activity, rather than a one-time task.

Comments

Please login to add a comment.

Don't have an account? Register now!