20.1 Designing an HPC application

Table of Contents

From Idea to HPC Application

Designing an HPC application is about turning a scientific or engineering question into a program that runs efficiently on a parallel machine. In the context of this course, “designing” does not only mean writing code. It covers choosing the right algorithms and decomposition, planning how the code will use the cluster, and structuring the project so that it can be tested, profiled, and improved.

This chapter focuses on the design process you will follow for the final project and similar real projects. It assumes that you already know the basic concepts of parallelism, MPI, OpenMP, GPUs, job submission, and performance analysis from earlier chapters. Here, you will connect those pieces into a coherent application design process.

Understanding the Problem and Setting Goals

Before thinking about MPI ranks or threads, you must define the problem clearly in computational terms. Start by translating the domain question into operations on data. For example, instead of “simulate the flow in a pipe,” think “solve the 3D incompressible Navier–Stokes equations on a structured grid for many time steps.”

Identify the input, output, and what “success” means. Is the goal faster time to solution, the ability to run a bigger model, higher accuracy, or cheaper runs? State constraints such as maximum wall-clock time, memory per core, and acceptable accuracy.

It is useful to sketch, informally, what a single-core serial version would do. Identify the core loop or kernel that dominates the work. Often 90 percent of the runtime is a small part of the code. Your HPC design will focus on that hotspot.

When defining goals for an HPC application, always be explicit about:

The primary objective, for example, minimize time to solution or maximize problem size for a fixed time.
The main computational kernel, that is, the part of the code that will consume most CPU or GPU time.
Constraints, such as memory per node, total wall-clock limit, and required accuracy.
Vague goals lead to vague designs and poor performance.

Design decisions will follow from this problem definition. A bandwidth-bound stencil code looks very different from a compute-bound dense linear algebra application.

Choosing the Right Parallelization Strategy

Once the problem is clear, you must decide how to exploit parallelism on the target system. At the design stage you do not write code yet. Instead, you choose patterns and levels of parallelism that fit the problem structure and the hardware.

First, decide the main style of parallelism. Data parallelism is often natural when your data can be split into chunks that are processed independently, such as grid partitions or particle subsets. Task parallelism is more suitable when you have multiple different operations or pipelines that can run concurrently, such as independent simulations with varying parameters.

Second, decide at which levels of the machine hierarchy you will use which model. Typical combinations include MPI across nodes with OpenMP threads inside nodes, or MPI plus CUDA for GPUs. Try to match models to hardware. Memory shared within a node suggests threads or shared-memory programming, while separate memories across nodes suggest message passing.

Finally, think about how far you realistically need to scale. Designing for hundreds of thousands of cores is very different from designing for a small departmental cluster. Your final project should aim for a scale that the cluster and job limits can support.

Decomposing Data and Work

The central design problem for an HPC application is decomposition. You must decide how to split data and computation among processes and threads. This is where you connect your mathematical or algorithmic description to parallel structure.

Start by listing the main data structures, such as arrays, meshes, graphs, or particle lists, and the main loops that operate on them. Then decide how to partition these data structures across ranks or threads. You typically choose between domain decomposition, where you split the simulated space or index range, functional decomposition, where you give different tasks to different ranks, or a combination.

For structured grids in PDEs, a common design is to partition the grid into blocks assigned to MPI ranks. Each rank holds a local subdomain with halo or ghost cells for neighboring data. Threads then work within the subdomain. For particle simulations, you can decompose by spatial cells containing particles or by assigning particles directly to ranks.

Decomposition quality directly affects load balance, memory usage, and communication. You want each unit of parallelism to have a similar amount of work and to minimize the amount of data that must cross between units. You also need to respect memory limits per node when deciding how big a local piece can be.

A good decomposition satisfies three key properties:

Balance: Each processing unit performs approximately equal work.
Locality: Most accesses are to data stored locally, not remote.
Minimal communication: The surface area between subdomains, and therefore the data to exchange, is as small as possible.
Poor decomposition leads to idle processors and excessive communication, regardless of how clever the rest of the code is.

You should sketch your decomposition on paper before you code. Draw how data is laid out, how ranks are arranged, and which neighbors need to talk. These sketches often reveal problems early, for example, one rank owning a disproportionately expensive region.

Designing the Communication and Synchronization Pattern

Once you have decomposed data and work, you must define how and when parallel entities communicate and synchronize. This is one of the most important architectural decisions for an HPC application.

First, characterize the communication pattern. Is it mostly nearest neighbor, global reductions, irregular sparse exchanges, or all-to-all? The pattern informs your choice of MPI routines, collective operations, and the need for custom communication schedules. On hybrid or GPU-based codes, you must decide which communication is done between CPUs and which requires moving data from or to accelerators.

Second, decide how often you communicate relative to computation. Many algorithms have a natural computation to communication ratio, for example a few compute steps followed by a halo exchange and a global reduction. Where possible, you want to increase the amount of local work per communication step and overlap communication with useful computation.

Third, consider synchronization costs. Barriers, implicit or explicit, can serialize your application if used too often or at the wrong places. Prefer localized synchronization, for example, per-neighbor message matching or fine-grained OpenMP synchronization, instead of global barriers whenever possible.

For each main phase of your algorithm, you should be able to state:

What data are communicated.
Which parallel entities participate.
The order in which messages are sent and received.
Whether there are global coordination points.

If any of these are unclear, the implementation will likely develop deadlocks and performance problems.

Algorithmic Structure and Time Stepping

Many HPC applications are time integrators, iterative solvers, or repeated optimization loops. When designing such applications, concentrate on the structure of the main iteration. This structure dictates where you do communication, I/O, and performance-critical computation.

A generic pattern is a time stepping loop of the form
$$
\text{for } n = 0,1,\ldots,N-1:
$$
followed by a sequence such as “exchange boundary data, update interior, update boundaries, apply constraints, compute diagnostics, possibly output or checkpoint.”

Even if your project is not a time integrator, you often have an outer loop performing repeated sweeps, iterations, or passes. Design that loop on paper and annotate each step as local compute, communication, or I/O. This will guide you in placing MPI calls, OpenMP regions, and I/O operations.

You also need to identify convergence criteria or stopping conditions. These are often based on norms or reductions such as
$$
r = \sqrt{\sum_i (x_i^{(k+1)} - x_i^{(k)})^2}
$$
which require collective communication. Designing where and how often to compute these quantities is part of the application structure.

Finally, remember that every step inside your main loop multiplies in cost by the number of iterations. Design from the start to keep operations inside this loop as simple and cache friendly as possible.

Planning for Memory and Data Layout

Memory design is not only an optimization step. It is part of application design. On HPC systems, the memory per node and the memory bandwidth often limit problem size and performance. You must plan your data structures accordingly.

Start by estimating the memory footprint of your main data structures. For example, a 3D array of double precision values with dimensions $N_x \times N_y \times N_z$ uses approximately
$$
8 \times N_x N_y N_z \text{ bytes.}
$$
Multiply by the number of variables stored, including temporary arrays and ghost cells. Then divide by the number of MPI ranks to estimate memory per rank. This quick calculation often reveals whether a design is viable on the intended hardware.

Data layout should match your access patterns. If you will iterate over the fastest varying index inside inner loops, ensure that the array is stored in a compatible order. Poor layout results in more cache misses and inferior vectorization. For hybrid designs, you must also decide which data reside on CPUs and which on GPUs, and how data transfers are orchestrated.

It is good practice to keep data structures simple and contiguous in memory when possible. Deeply nested objects or irregular pointer-based structures are harder to parallelize efficiently and harder to move onto accelerators.

I/O, Checkpointing, and Output Design

Input and output are often afterthoughts in small exercises, but for a realistic HPC application they must be part of the design. Heavy or poorly placed I/O can dominate runtime at scale.

First, identify what input your application requires. Large binary files, configuration parameters, or restart states all need a reading strategy that avoids bottlenecks at startup. You may choose a single rank to read and then broadcast parameters, while large input fields often require parallel I/O.

Second, design what output you need and how often. A tension exists between observing your simulation and keeping it fast. Dumping full fields every time step might be infeasible. Consider thinning in time, compressing data, or writing derived quantities instead of full states.

Third, include a checkpointing strategy in your design. For long runs on shared clusters, you must plan for the fact that jobs may be interrupted or hit time limits. Checkpoints should contain enough information to resume and should be written at intervals that balance restart safety with I/O overhead.

An effective I/O and checkpointing design follows these rules:

Keep I/O outside the hottest loops whenever possible.
Write the minimum data required for analysis and restart.
Use parallel I/O or collective operations when volumes are large.
Set a checkpoint interval that keeps the expected lost work acceptable if a failure occurs.
Ignoring I/O and checkpoints during design often leads to unusable or fragile applications.

These choices are particularly important in a course project where job time limits may force you to restart longer experiments.

Incremental Development and Prototyping

Even with a good design, writing a full-scale HPC application in one step is risky. A better approach is incremental development, where you start simple and gradually add complexity while preserving correctness.

A common pattern is to begin with a serial prototype or a minimal parallel version on a small problem. This prototype implements the core algorithm without sophisticated optimization or all intended features. It serves as a reference for correctness and an anchor for later performance comparisons.

Next, add parallelization in stages. For example, first introduce MPI with a simple domain partition and ensure correctness on a few ranks. Then introduce threading or GPU kernels inside each domain. After this, you can refine the decomposition or improve load balancing.

Throughout this process, preserve a version that you trust as a reference. After each significant change, compare results on small inputs against the reference. Differences can reveal bugs early, before they are buried in a large code base.

Design your application from the start so that it is testable at different sizes and on different numbers of processors. This often means parameterizing the grid size, number of particles, and iteration count via configuration rather than hardcoding them.

Designing for Measurement and Optimization

Performance is a primary concern in HPC, so you should treat performance measurement as a design requirement, not a late-stage patch. From the beginning, plan where you will insert timers, what metrics you will collect, and how you will run scaling experiments.

A practical approach is to define several named regions in your design corresponding to phases like initialization, computation, communication, and I/O. You can then instrument these regions with timers later. The phase boundaries should be clear in your design to make this feasible.

You should also anticipate which scaling questions you want to explore. For example, you might design your application so that you can vary the global problem size $N$ and the number of processes $P$, to test strong and weak scaling without changing source code. This often means using mathematical formulas for local sizes, such as
$$
N_{\text{local}} = \frac{N}{P}
$$
for a one-dimensional decomposition, and generalizations for multi-dimensional splits.

When you define key operations in the design, consider how they would appear in a performance profile. A dense matrix multiply, a sparse gather, or a global reduction each has well-known performance behavior. Understanding their role helps you prioritize optimization efforts in the project phase.

Software Structure and Modularity

HPC applications can quickly become unmanageable if all logic lives in one large file or function. A good design allocates responsibilities into modules or components, even for relatively small course projects.

Useful separations include a module for problem setup and parameter parsing, another for data structure allocation and initialization, one for the core computational kernels, and one for I/O and checkpointing. If you use hybrid parallelism or accelerators, it can help to separate CPU-only and GPU-specific code paths behind a common interface.

You should define clear function boundaries around operations like “perform one iteration step” or “exchange halo regions.” These boundaries make it easier to change implementation details later, for example, replacing a simple blocking communication with a nonblocking version, without rewriting the entire code.

Try to avoid mixing low-level details, such as MPI communicator calls, everywhere in your code where they are hard to track. Instead, centralize these concerns into a small number of places with well-defined responsibilities.

Planning Validation and Verification

Designing an HPC application also includes planning how you will show that it is correct and scientifically meaningful. Verification addresses whether the code solves the equations or algorithm correctly, while validation addresses whether the model represents reality appropriately. For a course project, you will mostly focus on verification.

You should design test cases that are small enough to run quickly and simple enough to have known or at least reference solutions. Examples include running on a coarse grid for which you can compute an analytic solution, or comparing against a high-precision or serial version on the same input.

In your design, identify where to compute errors or residuals. For example, you might define a discrete norm
$$
\|e\|_2 = \sqrt{\sum_i (u_i - u_i^\text{ref})^2}
$$
between your solution and a reference. Deciding how and when you compute such measures influences communication and performance, because these norms often require global reductions.

Finally, be aware that floating point arithmetic in parallel codes can lead to small differences between runs due to different reduction orders. Design your verification strategy with tolerances rather than exact equality, and be prepared to justify acceptable error thresholds in your project report.

Adapting the Design to the Final Project

Within this course, your final project offers a concrete context for applying these design principles. At the start of your project, you should produce a short design document that covers the key points:

You should state the problem and objectives in computational terms. You should describe your chosen parallelization model and decomposition strategy. You should outline the main algorithmic structure and iteration, including where communication and I/O occur. You should give rough estimates of memory requirements and expected scaling ranges. You should explain how you plan to validate results and measure performance.

Treat this design as a living document. As you test and profile your implementation, you may revise aspects of the decomposition, communication pattern, or I/O frequency. However, every change should be consistent with the core problem goals you defined at the very beginning.

An HPC application that is thoughtfully designed is far easier to implement, debug, profile, and improve. By following the process in this chapter, you will not only complete the final project more effectively, but you will also learn a design workflow that applies to real HPC codes beyond this course.

Comments

Please login to add a comment.

Don't have an account? Register now!