16.4 Testing strategies for parallel codes

Table of Contents

Introduction

Testing parallel programs is fundamentally different from testing sequential ones. Parallel codes have many possible execution orders, depend on subtle timing interactions, and often run on resources that you cannot fully reproduce on your laptop. This chapter focuses on strategies to design tests that are useful, reliable, and repeatable for shared memory, distributed memory, and hybrid parallel programs.

The goal is not to remove debugging, which is covered elsewhere, but to ensure that bugs are caught as early as possible, long before you run at scale on an expensive production system.

What makes testing parallel codes hard

Parallel tests must cope with nondeterminism. The same input can produce:

Different message interleavings between MPI processes.

Different scheduling of OpenMP threads.

Different ordering of floating point operations, which can slightly change numerical results.

A second difficulty is scale. Many bugs only appear:

When you use many processes or threads.

When the data size is large enough to stress memory, communication, or I/O.

You cannot simply run all tests at “full scale” every time, so you need a layered test strategy that balances speed and coverage.

Finally, performance regressions are much easier to introduce in parallel codes. A change that keeps the answer correct but doubles communication volume is still a serious bug in an HPC context.

Designing a parallel testing strategy

A good testing strategy for parallel codes combines several dimensions:

Varying scale, from serial to full parallel.

Varying process and thread counts.

Different decompositions and layouts.

Different system environments when possible.

Think of your test suite as a pyramid. At the bottom you have many fast, small tests that run on few cores and focus on correctness. Toward the top, you have fewer, more expensive tests that run at larger scale and also check performance and scalability.

A robust parallel testing strategy always includes:

Serial or single-rank / single-thread tests for basic logic.
Small parallel tests with different process and thread counts.
Tests that vary domain decompositions and communication patterns.
At least some medium or large scale tests that run regularly.

Using serial tests as a baseline

Even highly parallel codes usually contain serial building blocks, such as local kernels, utility functions, and I/O routines. Testing these in a serial context simplifies failure analysis and gives you a strong baseline.

For example, in an MPI program, you can:

Run the code with np = 1 as a serial test whenever that is meaningful.

Wrap communication parts so that, in a special “serial test” mode, they are bypassed or replaced by local copies.

In OpenMP programs, you can:

Disable OpenMP with OMP_NUM_THREADS=1 to check that the code works like a serial program.

Compile without OpenMP support for certain test configurations.

The serial tests verify mathematical correctness, boundary conditions, error handling, and basic I/O. Once these are trusted, you can more easily attribute later failures to parallel interactions.

Small-scale parallel unit tests

After serial tests pass, you should write unit tests that actually use multiple threads or processes. The challenge is that traditional unit test frameworks are usually not parallel-aware, so you must decide:

How the test is orchestrated across ranks or threads.

Who checks correctness and reports failures.

In MPI codes, common strategies are:

Let only rank 0 perform assertions on globally gathered results, while other ranks participate in computation and communication.

Have each rank perform local checks on its data, then use MPI_Allreduce with logical AND / OR to combine success or failure, and finally let rank 0 report.

In shared memory codes:

Use a single “testing thread” for assertions, and ensure other threads reach a barrier before and after the checks.

Avoid printing from many threads or ranks in tests, since this makes test output confusing and can hide real failures.

Tests at this level typically run with 2, 4, or 8 ranks or threads, and focus on small data sizes where you can predict exact results.

Testing nondeterministic behavior

Nondeterminism in parallel codes comes from scheduling and communication ordering. Some bugs, for example race conditions or missing synchronization, may not appear in every run. To increase your chance to detect them, you can:

Run the same test multiple times, possibly changing environment variables or random seeds.

Change the number of processes or threads.

Introduce small artificial delays in selected code paths.

In shared memory tests, you can:

Use environment variables that influence thread scheduling, such as different OpenMP scheduling types (STATIC, DYNAMIC, GUIDED) or chunk sizes, to vary the execution order.

In MPI tests, you can:

Run with different process counts that do not divide the data size evenly, which changes the pattern of message sizes and communication partners.

Use tools or wrappers that add random delays around communication calls, when available.

It is important to control randomness. If your code uses random input, make tests repeatable by setting explicit seeds and documenting them. For stress tests that intentionally vary behavior, you can allow some controlled randomness but log the seed used so that you can reproduce a failing run.

Numerical correctness and tolerance-based checks

Parallel execution often changes the order of floating point operations. Floating point addition is not associative, so

$$ (a + b) + c \neq a + (b + c) $$

in general. Even if every operation is correct, the final result may differ slightly across process counts or thread configurations. Test assertions must be designed with this in mind.

Instead of comparing floating point results bit by bit, use tolerances on the absolute or relative error. A standard pattern is:

$$ \text{error} = \frac{\lVert x_{\text{test}} - x_{\text{ref}} \rVert}{\lVert x_{\text{ref}} \rVert} $$

and require that error is less than some tolerance tol.

For floating point results in parallel tests, always:

Compare with a reference solution using norms.
Use absolute and/or relative tolerances.
Make the tolerances strict enough to catch real errors, but not so strict that harmless rounding differences cause failures.

Reference solutions for parallel tests can come from:

A trusted serial implementation with high precision.

A smaller problem that you can solve analytically.

A known-good version of the code whose output you store as “golden” data, with careful documentation.

When results vary slightly between process counts, you must decide whether this is acceptable. Ideally, you express your physics or mathematics in terms of invariants that should hold independent of parallel layout, such as conservation of mass, energy, or probability.

Structured test problem design

Good test cases for parallel programs are not arbitrary. They are chosen to:

Exercise edge cases in domain decomposition.

Trigger different communication paths.

Expose synchronization and load balancing issues.

For domain-decomposed codes, you can choose small grids or meshes that:

Do not divide evenly across ranks, for example grid size 10 with 3 ranks.

Have boundaries that require ghost cells or halo exchanges.

Include corner and edge cases like periodic boundaries or mixed boundary conditions.

For parallel I/O, you can design tests where:

Each rank writes a different size of data.

Some ranks write no data at all.

The global file layout is intentionally irregular.

You should also test degenerate cases, such as:

Running with more ranks than data blocks, so that some ranks receive no data.

Running with the minimum and maximum supported number of threads.

These tests make it more likely to uncover assumptions in your code that only hold for “nice” and symmetric configurations.

Regression tests for concurrency bugs

Once you discover and fix a parallel bug, especially a concurrency bug such as a race condition or deadlock, you must add a regression test that is designed to fail if the bug ever returns.

For example, if you fixed a race in a shared array:

Create a small test program that stresses the array update code with many threads and iterations.

Run this test multiple times or with varying thread counts.

Make it part of your regular test suite.

For a fixed deadlock in an MPI collective communication:

Write a test that exercises the specific communication sequence that used to deadlock.

Use timeouts in your test environment so that a deadlock causes a test failure rather than an infinite hang.

Regression tests for parallel bugs often need some extra instrumentation or configuration options. It is useful to keep small “bug reproducer” programs isolated and simple, so that future developers understand what is being tested.

Testing scalability and performance regressions

In HPC, output correctness is only half of the story. Codes that slow down significantly due to a minor change can be as problematic as codes that crash. Therefore, you should complement functional tests with performance tests.

Performance tests compare metrics such as:

Total runtime.

Time spent in major phases (computation, communication, I/O).

Speedup and efficiency as you vary process counts or thread counts.

You can define acceptable performance envelopes. For instance, a test might require that the runtime on 32 ranks is not more than 20% slower than for a known-good version. These tests are usually more expensive and may not run on every commit, but they should run regularly, such as nightly or weekly.

To test scalability you can measure:

Strong scaling, keeping problem size fixed and increasing resources.

Weak scaling, increasing problem size proportionally to resources.

For testing purposes you do not need production-scale sizes, but you should still use problem sizes large enough that communication and computation patterns are realistic. Tiny test problems often fit entirely in cache and may hide scalability issues.

Automation and continuous integration for parallel tests

Continuous integration (CI) is more challenging in HPC because CI servers may not provide MPI, GPUs, or large core counts. Still, you can design your test suite so that a meaningful subset runs automatically.

Useful approaches include:

A “minimal CI” configuration that runs serial and small-scale parallel tests, for example MPI with 2 or 4 ranks and a few OpenMP threads.

Optional pipelines that run on special runners connected to an HPC cluster, scheduled less frequently, but with larger scale and performance tests.

Configuration files or scripts that make it easy to run the full test suite on the target cluster. For example, you can provide test drivers that submit test jobs via the scheduler.

Integration of existing testing frameworks with MPI is often done by:

Wrapping the test executable in mpirun or srun inside the CI job configuration.

Using environment variables to control the number of ranks and threads.

A practical pattern is to define multiple test targets, such as:

test-serial for pure serial tests.

test-mpi-small for small MPI tests.

test-hybrid-small for MPI plus OpenMP.

test-perf for performance and scaling tests.

Then you select which targets to run in each automated environment.

Deterministic test configurations

Although some nondeterminism is unavoidable, testing benefits from configurations that are as deterministic as possible. This makes failures reproducible, which is essential for debugging and verifying fixes.

To increase determinism you can:

Pin threads to cores using environment variables or runtime options, so that operating system scheduling is more consistent.

Use fixed random seeds for any randomized algorithms.

Avoid test designs that depend heavily on exact timing, which can vary between runs and machines.

If your code has optional features that introduce additional nondeterminism, such as asynchronous progress or nonblocking I/O, define specific tests that enable or disable them. This separation helps you understand which component is responsible when a test fails.

Testing hybrid and heterogeneous codes

Hybrid codes that use both MPI and threads, or that offload work to GPUs or other accelerators, need additional care in testing.

For hybrid MPI plus OpenMP or MPI plus another threading model:

Combine the earlier strategies: vary both rank counts and thread counts, and include combinations that are not divisors of the problem size.

Test configurations where each node runs multiple ranks with multiple threads, and configurations where each node runs a single rank with many threads. These layouts stress different parts of the memory and communication hierarchy.

For GPU or accelerator codes:

Include tests that run with GPUs disabled or in a “CPU fallback” mode if such a mode exists. This provides a simpler correctness baseline.

When testing with accelerators enabled, use small data sizes for functional tests, and reserve larger data sizes for performance tests.

Be aware that floating point behavior and performance characteristics can differ between CPU and GPU, so tune your tolerances and performance expectations accordingly.

Organizing and documenting parallel tests

A good parallel test suite is not just a collection of executables. It is structured and documented so that others can understand and maintain it. Helpful practices include:

Grouping tests by purpose, such as correctness, regression, scalability, and performance, rather than only by source directory.

Documenting for each test:

Which parallel configuration it assumes, such as numbers of ranks and threads.

The expected runtime and resource requirements.

The reference output or invariants used to check correctness.

Any known limitations, such as dependence on a specific MPI implementation or filesystem.

Separating small “developer tests” that run quickly from larger “integration tests” that require the actual cluster environment. Developers can run the small tests frequently, while the full suite runs on shared infrastructure.

Documenting how to run the test suite on the target HPC system, including scheduler commands, modules, and environment setup, greatly lowers the barrier for new contributors.

Summary

Testing parallel codes requires you to think beyond simple input and output comparisons. You must handle nondeterminism, floating point differences, and concurrency bugs that may only show up under specific parallel configurations. A layered strategy that starts with serial tests and builds up to small, medium, and large scale parallel tests can provide both quick feedback and broad coverage.

By using tolerance-based numerical checks, structured test problems, regression tests for concurrency bugs, and automated integration on at least a subset of your target environment, you create a test suite that protects both correctness and performance as your parallel code evolves.

Comments

Please login to add a comment.

Don't have an account? Register now!