20.1 Designing an HPC application

Table of Contents

From Idea to HPC Application

Designing an HPC application is about systematically turning a scientific or engineering problem into a scalable, testable, and maintainable parallel program that runs effectively on real systems. This chapter focuses on the practical design process and decisions you must make before and during implementation.

Clarifying the Problem and Goals

Before thinking about MPI, OpenMP, GPUs, or clusters, define:

Scientific/engineering question: What do you want to compute, predict, or analyze?
Inputs and outputs:

What data formats, sizes, and units?
What exactly should the program produce?

Accuracy and fidelity:

Required precision (e.g., single vs double precision).
Acceptable numerical error or tolerance.

Performance targets:

Problem sizes you must handle (e.g., $10^6$ vs $10^{10}$ data points).
Time-to-solution constraints (hours vs days).
Resource budget: maximum cores/GPUs, memory, and storage you can realistically use.

Write these down as a problem specification. It will drive all subsequent design choices and provides a reference for testing and performance evaluation.

Choosing a Parallelization Strategy

You rarely start by coding; you first decide how the work should be parallelized conceptually.

Identify Core Computations

Break the problem into its main computational kernels, for example:

Time-stepping loop in a PDE solver.
Matrix–vector products in a linear solver.
FFT-based convolution in signal processing.
Large ensemble of independent simulations.

For each kernel, characterize:

Approximate complexity (e.g., $O(N)$, $O(N^2)$, $O(N^3)$).
Data structures (arrays, sparse matrices, particles, graphs).
Data dependencies (neighbor access, global reductions, long-range interactions).

Map to Parallelism Types

Based on your kernels, decide dominantly:

Task parallelism: Many independent or loosely coupled tasks (e.g., parameter sweeps, Monte Carlo runs).
Data parallelism: The same operation applied to many data elements (e.g., grid points, particles, matrix rows).
Hybrid: Task-level parallelism across nodes, data parallelism within each node.

Consider early whether the computation is better suited to:

Shared-memory parallelism (e.g., OpenMP) within a node.
Distributed-memory parallelism (e.g., MPI) across nodes.
Accelerator-based parallelism (e.g., CUDA, OpenACC) on GPUs.
Hybrid MPI+X (OpenMP, CUDA, etc.) for multi-node, multi-core, multi-GPU systems.

Your choice should match:

Available hardware on your target cluster.
Team skills and learning goals.
Long-term maintainability of the code.

Designing Data Decomposition

Once you know the parallelism style, you must decide how to split your data.

Domain Decomposition

For grid- or mesh-based problems (e.g., simulations on a 2D/3D space):

Decompose the physical domain into subdomains:

1D slab decomposition (split along one axis).
2D or 3D block decomposition (tiles or blocks).

Each process or thread owns one subdomain.
Consider:

Surface-to-volume ratio: Minimizing communication (halo exchange size) relative to computation.
Load balance: Equal number of grid points or equal computational work per process.
Boundary conditions: How to handle physical boundaries and interfaces between subdomains.

Data Partitioning for Collections

For collections like particles, matrices, or graphs:

Particles:

Spatial decomposition (cells in space).
Particle-based decomposition (equal number of particles per process).

Dense matrices:

Row-wise, column-wise, or block (2D) decomposition.

Sparse matrices / graphs:

Partition rows or graph nodes to reduce edge cuts and communication.

Make the decomposition explicit in your design documents:

Which process/worker owns which subset of data.
What information must be exchanged between owners and when.

Designing the Parallel Algorithm

With a decomposition, design the parallel workflow step-by-step.

High-Level Algorithm Structure

Sketch your algorithm as high-level pseudocode with clearly marked parallel regions. For example, for a time-stepping simulation:

Initialize domain and data.
Distribute data across processes/nodes.
For each time step:

Exchange boundary (halo) data with neighbors.
Compute local updates.
Optionally compute global diagnostics (e.g., norms, energy).

Gather final data or write distributed output.

Write this in annotated pseudocode such as:

Initialize global parameters and problem size
Partition domain among P processes
For each process:
    Allocate local subdomain (with halo/ghost zones)
    Initialize local data
For t in 1..T:
    Exchange halo data with neighboring processes
    Compute local updates on interior points
    Update boundary points using received halo data
    If output step:
        Compute local diagnostics
        Perform global reductions for diagnostics
        Write output (local or parallel I/O)
Finalize and free resources

This helps you:

Identify where communication occurs.
See opportunities to overlap communication and computation.
Decide which parts are performance-critical kernels.

Communication and Synchronization Design

Decide up front:

Which data items need to be communicated.
Who communicates with whom (e.g., nearest neighbors, all-to-all, root gathering).
How often communication happens (once per step, occasionally, or frequently).
Type of operations:

Point-to-point exchanges.
Collective operations (reductions, broadcasts, scatters/gathers).
Synchronization barriers (only when necessary).

Plan to minimize:

Global synchronization points.
Large, frequent messages on the critical path.
Contention for shared resources (e.g., file system, locks).

Algorithmic Choices for Scalability

For the same mathematical problem, different algorithms can have very different scalability. In your design, consider:

Asymptotic cost: Favor algorithms with lower complexity for large $N$ even if constants are somewhat larger.
Communication-avoiding algorithms where possible:

Fewer global reductions.
Aggregated communication instead of many small messages.

Numerically stable algorithms that behave well in parallel (e.g., stable reductions, robust preconditioners).

Document why you choose a particular algorithm, including:

Trade-off between simplicity and performance.
Expected scaling behavior (e.g., which parts will limit strong scaling).

Designing for the Target Architecture

An HPC application must be tailored to the actual hardware it will run on. During design, gather:

Cores per node, memory per node, and cache characteristics.
Presence and count of GPUs per node.
Interconnect technology (e.g., Ethernet, InfiniBand).
Available libraries and compilers.

Then design:

Node-Level Strategy

Within a node:

How many processes vs threads should you use?
How will you map threads to cores and, if relevant, to NUMA domains?
Which loops or kernels will be threaded/vectorized?
How to lay out data in memory (AoS vs SoA) to enable vectorization and cache reuse.

Cluster-Level Strategy

Across nodes:

How many nodes will be used.
How processes will be mapped to nodes and sockets (e.g., 1 MPI rank per socket vs per core).
How the domain decomposition maps onto the process grid.
How communication patterns align with the network topology (where that matters).

Accelerator Strategy (If Used)

If targeting GPUs or other accelerators:

Which parts of the computation will be offloaded.
How data moves between host and device.
How you will minimize data transfers (e.g., keep data resident on device across iterations).
How to manage multiple GPUs per node.

Modularity and Code Organization

Poor structure kills maintainability and makes optimization harder. Design module boundaries before you write code:

Suggested separation of concerns:

Core numerical kernels:

Functions or modules that implement the math (e.g., matrix–vector product, flux computation).
Designed to be agnostic of MPI or I/O where feasible.

Parallel infrastructure:

Initialization and finalization of MPI/OpenMP/GPU environment.
Domain decomposition and neighbor metadata.
Communication routines (e.g., halo exchange helpers, reduction wrappers).

I/O and configuration:

Reading input parameters and initial conditions.
Writing checkpoints and output fields.
Logging and diagnostic summaries.

Driver / main program:

Parses configuration, constructs objects, orchestrates major steps.

Benefits:

You can replace kernels or communication implementations without rewriting the whole code.
Testing becomes easier (e.g., test a kernel on a single core before integrating).
Porting to another architecture (e.g., CPU-only to GPU-accelerated) affects fewer components.

For a course project, explicitly sketch:

A list of source files or modules and their responsibilities.
Public interfaces (APIs) for each module: functions, data types, and pre/post-conditions.

Using Libraries and Existing Components

A key design skill is deciding what not to write:

Use established numerical libraries for:

Linear algebra, FFTs, random numbers, etc.

Use parallel I/O libraries or high-level interfaces when available.
Reuse:

Cluster-provided modules.
Existing project templates or skeleton codes (if allowed by your course/project rules).

Design your code so that:

Library calls are localized in dedicated modules.
It’s easy to switch libraries (e.g., from a simple solver to a more advanced one) by changing a small part of the code.

Planning for Input, Output, and Checkpointing

I/O strategy is part of design, not an afterthought.

Input Strategy

How configuration is specified:

Command-line arguments.
Simple configuration files (e.g., JSON, YAML, or plain text).

How large input datasets are read:

Replicated (small inputs read by all processes).
Distributed (each process reads only its part).
Root process reads then distributes.

Output Strategy

Decide:

What needs to be written (fields, diagnostics, logs).
At what frequency (per step, every N steps).
In which format (simple text vs binary vs standardized scientific formats).

For HPC, consider:

Parallel output (each process writes its file) vs collective output.
File size and file count (avoid millions of tiny files).
Post-processing workflow: how users will visualize or analyze results.

Checkpointing

Design checkpointing into the application:

What minimal state is needed to restart (fields, time step, parameters).
How often to checkpoint, balancing:

Runtime overhead.
Risk of losing work in a failure.

Format:

Simple, versioned formats that you can evolve later.
Compatibility between versions where possible.

Designing a Testing and Validation Strategy

Before writing code, define how you will know it is correct.

Levels of Testing

Design for:

Unit tests for key kernels:

Small test problems with known analytical or reference solutions.

Integration tests:

Small-scale runs that exercise communication, I/O, and control flow.

Regression tests:

Fixed inputs whose outputs are stored and checked for changes as the code evolves.

Validation Cases

Choose validation problems that:

Have exact or high-accuracy reference solutions.
Are small enough to run quickly on limited resources.
Exercise important physics or mathematics relevant to your application.

Write down:

Test inputs.
Expected outputs or error thresholds.
How you will automate checks (scripts, CI, or manual procedures for the project).

Performance-Aware Design

Even before detailed optimization, design for reasonable performance:

Performance Model Sketch

Estimate:

Total floating-point operations for key kernels.
Data volume moved between:

Memory levels.
Processes (communication).
Disk (I/O).

Use these rough estimates to anticipate:

Whether you will be compute-bound, memory-bound, or communication-bound.
Which parts of the code will need the most optimization attention.

Minimizing Overheads by Design

In your design, aim to:

Reduce unnecessary data copies.
Batch small operations into larger ones.
Organize data in contiguous arrays where possible.
Plan to avoid frequent allocation/deallocation in inner loops.

Document the critical paths in your algorithm (sections that will dominate runtime) and design those paths with extra care.

Planning for Scalability Experiments

Your project will likely require demonstration of scaling behavior. Design the application so that scaling studies are straightforward:

Parameterize problem size (not hard-coded) so you can:

Increase grid resolution or number of particles.
Increase number of independent tasks.

Parameterize parallelism:

Number of processes, threads, and GPUs.

Provide configurable I/O frequency to isolate:

Compute + communication from I/O effects.

Plan which experiments you will run:

Strong scaling: fixed problem size, increasing resources.
Weak scaling: increasing problem size proportionally to resources.

This impacts design; for instance, the domain decomposition should gracefully handle growing process counts.

Documentation and User Interface Design

An HPC application often has multiple users (including your future self). Design with usability in mind:

User-Facing Interface

Decide how users will:

Run the code (command-line interface, configuration files).
Select problem size, time steps, solver options, output settings.
Understand what the code is doing (logging, progress reports).

Internal Documentation

Before coding, outline:

A short README explaining:

Purpose and main features.
How to build and run basic examples.

Developer notes:

Code structure and data layout.
Parallel decomposition and communication patterns.
Known limitations and assumptions.

Design choices should be traceable from the documentation to the implementation.

A Practical Design Workflow for the Course Project

For your final project, a concrete step-by-step design process could be:

Problem definition:

Write a one-page description of the scientific/engineering task, inputs/outputs, and goals.

Parallelization plan:

Identify core kernels and whether they are task- or data-parallel.
Choose programming model(s) (MPI, OpenMP, GPU, or hybrid).

Data and domain decomposition:

Sketch how data is partitioned and which process/thread owns what.
Draw diagrams for domain decomposition if spatial.

Algorithm and communication design:

Write pseudocode with clearly marked communication phases and parallel loops.
Specify which operations are collective and where synchronizations occur.

Architecture mapping:

Decide process/thread/GPU counts per node for typical runs.
Plan memory usage (estimate per-process memory needs).

Module structure and interfaces:

Define source files/modules and their responsibilities.
List public APIs and data structures.

I/O and checkpointing plan:

Decide formats, frequencies, and strategies for input, output, and restart.

Testing and validation plan:

Choose test cases and validation metrics.
Decide how you will automate or repeatedly run them.

Performance and scaling plan:

Identify performance-critical regions.
Define a small set of scaling experiments you will perform later.

Documentation plan:

Outline README, usage examples, and developer notes.

By following this structured design process, you create an HPC application that is not only parallel and fast, but also understandable, verifiable, and extensible—qualities that are crucial in real-world HPC projects.

Comments

Please login to add a comment.

Don't have an account? Register now!