20.3 Performance analysis and optimization report

Table of Contents

Purpose of the Performance Analysis and Optimization Report

For the final project, you are not only running an HPC application—you are also expected to analyze its performance and document how you improved it. The performance analysis and optimization report is where you:

Show that you can measure performance in a structured way.
Demonstrate that you can interpret profiling/benchmarking data.
Explain optimization choices and their quantitative impact.
Reflect on limitations, trade-offs, and future work.

Think of it as a mini research paper on your code’s performance, not just a log of what you tried.

Expected Structure of the Report

Use a clear, logical structure. A common, acceptable outline is:

Abstract
Introduction and Goals
Experimental Setup
Baseline Performance
Performance Analysis
Optimization Strategies
Results and Comparisons
Discussion and Lessons Learned
Conclusions and Future Work
Reproducibility Appendix (commands, scripts, module lists)

Below is guidance on what to include in each section.

Abstract

A short (5–10 sentences) summary:

What application you studied (e.g., 2D heat equation, N-body simulation, matrix multiplication).
What kind of parallelization is used (e.g., OpenMP, MPI, GPU, hybrid).
Key performance metrics you measured (e.g., runtime, speedup, efficiency).
Main optimizations you implemented.
Headline results (e.g., “Runtime reduced by 3.4× on 32 cores; parallel efficiency improved from 42% to 73%”).

Write the abstract last, when you know your final results.

Introduction and Goals

Clarify what you are trying to achieve from a performance standpoint, not just a scientific one.

Include:

A very brief description of the problem and code (1–2 paragraphs max).
The main performance questions you want to answer, e.g.:

How does runtime scale with number of cores?
What limits the speedup?
Is performance bound by computation, memory, or communication?

The optimization goals, such as:

Reduce time-to-solution for a fixed problem size.
Achieve good strong scaling up to N cores.
Improve GPU utilization.

Make the goals concrete and, where possible, quantifiable (e.g., “aim to double performance on 16 cores versus the baseline version”).

Experimental Setup

The point of this section is reproducibility and context. You are answering: “On what system, with what configuration, did these numbers come from?”

Include:

Hardware description

Node type and key specs: CPU model, cores per node, memory; GPU model if used.
Interconnect type if relevant (e.g., InfiniBand vs Ethernet).

Software environment

Compiler and version, relevant flags.
MPI, OpenMP, CUDA, or other runtime/library versions.
Module names (if your cluster uses modules).

Problem configuration

Problem size(s) you use for performance tests.
Any relevant algorithmic settings (e.g., number of iterations, convergence criteria).

Job configuration

Number of nodes, tasks per node, threads per task, GPUs per node.
Any important scheduler or runtime parameters (e.g., binding, --exclusive, OMP_NUM_THREADS, CUDA_VISIBLE_DEVICES).

Prefer concrete detail over vague descriptions. Example:

“All runs used 1–4 CPU-only nodes, each with 2 × 16-core Intel Xeon Gold 6130 (2.1 GHz), 192 GB RAM, connected via EDR InfiniBand. Code compiled with GCC 12.2, -O3 -march=native -fopenmp.”

Baseline Performance

Define a clear baseline version of your code. This is your reference point for all optimizations.

Explain:

What the baseline implementation is (e.g., naive serial version, first parallel version without optimizations).
Any known limitations or obvious inefficiencies (but do not fix them yet—this is the starting point).
How you measured baseline performance:

Which metric (e.g., wall-clock time, iterations per second).
How many repetitions per configuration to reduce noise.
How you handled initialization, I/O, and warm-up.

Present at least:

Runtime for one or two representative problem sizes.
If relevant, baseline strong-scaling or weak-scaling data on a small range of cores.

Tables or simple plots are appropriate. For example:

A table with runtime vs. core count for a fixed problem size.
A plot showing runtime on log–log axes or speedup vs. number of cores.

Be clear about units (seconds vs milliseconds) and what is included (e.g., “reported times exclude file I/O”).

Performance Analysis

Here you dig into why the baseline performs as it does. You will use tools and metrics introduced earlier in the course, but this section focuses on:

Presenting the measurements.
Interpreting them.
Connecting them to code behavior.

Typical subsections:

Timing and Scaling Metrics

Include key high-level metrics such as:

Wall-clock time $T(p)$ vs. number of processes/threads $p$.
Speedup: $S(p) = \dfrac{T(1)}{T(p)}$.
Parallel efficiency: $E(p) = \dfrac{S(p)}{p}$.

Show and comment on:

Where speedup starts to flatten.
Where efficiency drops below some threshold (e.g., 50%) and why you suspect that happens (load imbalance, communication overhead, memory bandwidth limits, etc.).

Profiling and Hotspots

Summarize what your profiling tools indicated:

Which functions or code regions dominate runtime (e.g., “90% in matvec()”).
Whether the code is compute-bound or memory-bound, based on indicators like:

CPU utilization.
Cache miss rates.
Memory bandwidth usage.

For GPU codes:

Kernel execution time vs. data transfer time.
Achieved occupancy or achieved memory bandwidth.

Do not reproduce entire profiler outputs—pick the most relevant numbers, tables, or plots.

Parallel Overheads and Bottlenecks

Describe any performance issues you discovered, for example:

Load imbalance: some processes/threads spend time idle or in synchronization.
Communication overhead: large fraction of time in MPI calls or synchronization points.
Serial sections: parts of the code that cannot be parallelized and therefore limit scaling.
Poor data locality or cache usage.

Support claims with evidence such as:

Time spent in MPI or synchronization primitives.
Large variation in per-rank timings.
Profiling metrics that point to memory or communication bottlenecks.

The goal is to identify a small number of primary bottlenecks to target with optimization.

Optimization Strategies

Explain the changes you made to improve performance. For each optimization, describe:

Motivation

Which bottleneck it targets.
Why you expected it to help.

Implementation

A brief technical description of the change.
Code snippets or pseudocode where it clarifies things.
Any modifications to algorithms, data structures, or parallelization strategy.

Potential trade-offs

Increased code complexity.
Reduced generality or flexibility.
Changes in memory usage.
Possible impact on numerical behavior (if applicable).

Organize this section as a sequence of clearly labeled optimizations, e.g.:

Optimization 1: Improve data locality by changing array layout.
Optimization 2: Reduce synchronization by restructuring loops.
Optimization 3: Overlap communication and computation.
Optimization 4: GPU kernel tuning (block size, memory access pattern).

Do not just list them—connect each optimization to the earlier analysis, showing a chain of reasoning from measurements → hypothesis → change.

Results and Comparisons

Now you quantify the impact of your optimizations. This section should be data-rich but focused.

For each important configuration, present:

Baseline vs. optimized runtimes.
Speedup of optimized version over baseline: $S_{\text{opt}} = \dfrac{T_{\text{baseline}}}{T_{\text{optimized}}}$.
Updates to strong-scaling or weak-scaling curves.
Changes in parallel efficiency.

Typical presentations:

Plots of runtime vs. cores for baseline and optimized versions on the same graph.
Bar charts comparing breakdowns of time spent in major routines before and after optimization.
Tables showing selected system metrics (e.g., memory bandwidth usage, GPU utilization) before and after.

For each figure or table, always add a short interpretation, e.g.:

“After optimizing communication, runtime on 32 cores decreased from 120 s to 45 s (2.67× speedup). Parallel efficiency at 32 cores improved from 31% to 82%.”
“Cache miss rate dropped from 18% to 7%, which correlates with the improved performance of the inner compute kernel.”

Emphasize trends, not just raw numbers.

Discussion and Lessons Learned

Reflect on what the results tell you about your code and the system.

Discuss:

Which optimizations were most effective and why.
Which attempts did not help (or even made things worse) and what you learned from that.
The role of hardware characteristics (e.g., memory bandwidth, interconnect latency, GPU architecture) in your results.
Any limits you encountered:

Saturated memory bandwidth.
Communication-dominated scaling at high core counts.
Diminishing returns from further tuning.

This section shows critical thinking: you are not just reporting success, you are analyzing outcomes and trade-offs.

Conclusions and Future Work

Provide a concise summary:

Restate the main performance achievements, with actual numbers.
Highlight the key bottlenecks that remain (if any).
Suggest concrete next steps that could be explored if you had more time, such as:

Alternative algorithms with better complexity or locality.
More advanced communication patterns.
Additional GPU optimization or mixed-precision techniques.
Better load-balancing strategies.

Keep this section short but specific.

Reproducibility Appendix

Add an appendix with everything needed to reproduce your main results. At minimum include:

Build instructions

Compile commands (or build system targets).
Important compiler flags.

Run commands

Example job scripts.
Command lines used to launch the code (e.g., srun, mpirun).
Environment variables (e.g., OMP_NUM_THREADS, KMP_AFFINITY).

Module and environment info

Modules loaded (module list).
Any relevant environment module files or container recipes.

Input data

How to generate or obtain input data.
Parameters used for the reported experiments.

This section is crucial for credibility and is often what separates a good report from an excellent one.

Practical Tips for Writing the Report

Collect data systematically

Automate runs when possible (scripts, simple Make/CMake targets).
Keep a log of each experiment (date, system, configuration, notes).

Control variability

Run multiple times and average when runtimes are short.
Avoid heavy load on shared login nodes; use compute nodes.
Note anomalies instead of silently ignoring them.

Visualize wisely

Use clear axis labels, legends, and units.
Do not overload plots; separate figures for different experiments.

Stay honest

It is acceptable if some optimizations did not work or if scaling is limited—as long as you analyze and explain.
Do not cherry-pick only the best runs; report typical behavior.

Evaluation Criteria

Your performance analysis and optimization report will typically be assessed on:

Clarity and structure: Is the report logically organized and easy to follow?
Technical correctness: Are metrics, formulas, and interpretations used appropriately?
Depth of analysis: Did you identify meaningful bottlenecks and connect them to hardware and code behavior?
Effectiveness of optimizations: Did you achieve measurable improvements, or at least rigorously explore why some attempts failed?
Reproducibility: Could someone else reproduce your main results from the information given?

Use this chapter as a checklist while you work on your final project and as you write up your findings.

Comments

Please login to add a comment.

Don't have an account? Register now!