Table of Contents
Goals and challenges of testing parallel codes
Testing parallel programs is not just “run the same tests, but faster.” Concurrency introduces new failure modes:
- Non-determinism: bugs may appear only for some interleavings, input sizes, process counts, or hardware.
- Heisenbugs: adding debug output or running under a debugger can make bugs disappear by changing timing.
- Scale dependence: a test that passes on your laptop with 4 threads may fail on the cluster with 1024 ranks.
- Environment dependence: different compilers, MPI libraries, CPU models, and interconnects can change behavior.
Effective strategies for parallel testing try to:
- Control or explore different schedules (interleavings, process/thread mappings).
- Exercise tests at multiple scales (small, medium, large), not just “production scale.”
- Make failures visible and diagnosable (good assertions, logging, checksums).
- Keep tests as fast and automatable as possible.
The following sections focus on practical testing strategies that are specific to parallel programs in HPC.
Unit testing in parallel contexts
Unit testing is still the foundation, but parallel code changes how you design units and what you verify.
Isolating units from parallel runtimes
Where possible, separate pure computation from parallel orchestration:
- Put math kernels, algorithms on small data, etc. in serial testable functions.
- Keep MPI or OpenMP usage in thin wrapper layers around these functions.
Then:
- Test the pure parts with standard unit test frameworks (e.g.,
pytest, GoogleTest, Catch2). - Use mock objects or simple “fake” implementations instead of full MPI calls or thread pools where appropriate.
This reduces the amount of code that must be tested under complex parallel conditions.
Unit testing thread-safe components
Where concurrency cannot be factored out (locks, queues, work-stealing schedulers):
- Write focused tests that:
- Create multiple threads.
- Exercise concurrent operations (e.g., many threads push/pop from a queue).
- Use stress-style unit tests:
- Loop operations many times.
- Randomize access patterns to increase the chance of exposing race conditions.
To keep CI runs fast, you can provide:
- A “light” mode (few threads, few iterations).
- A “stress” mode (more threads, more iterations) run less frequently (e.g., nightly).
Regression and integration testing at different scales
Parallel HPC codes often have large, complex integration surfaces: MPI, filesystems, accelerators, and external libraries. Effective regression testing must consider:
- Configuration diversity (thread counts, ranks, nodes).
- Problem sizes (tiny, realistic, large).
- Hardware (CPU-only vs GPU-enabled, different clusters).
Multi-configuration test matrix
Design an automated test matrix that varies:
- Number of threads (e.g.,
OMP_NUM_THREADS=1,2,8). - MPI process counts (e.g.,
-n 1,2,4,8,16). - Hybrid configurations (MPI ranks × threads per rank).
You do not need exhaustive combinations; choose representative configurations:
- Minimum scale: 1 rank / 1 thread (helps distinguish algorithmic bugs from parallel bugs).
- Small parallel setups: e.g., 2–8 ranks or threads (good for CI).
- Medium setups: e.g., 32–128 ranks or threads (for scheduled nightly/weekly tests).
Document which configurations are critical for your application (e.g., “2D decomposition requires at least 4 ranks”).
Scaled problem sizes
Use different problem sizes for different testing goals:
- Tiny:
- Run in seconds.
- Cover code paths and control logic.
- Used in CI and for quick local testing.
- Small/medium:
- More realistic workload.
- Enough data to trigger issues like buffer overflows, halo exchanges, cache behavior.
- Large:
- Used less frequently (e.g., before releases).
- Can reveal performance-related bugs (timeouts, hangs, memory pressure).
For regression tests, you typically fix the input size and configuration to maximize reproducibility.
Deterministic vs non-deterministic tests
Parallel code naturally introduces non-determinism (e.g., order of summation, reduction, or task completion). For testing:
Enforcing determinism where possible
Try to make test runs deterministic by design:
- Use fixed seeds for random number generators.
- Use deterministic algorithms or deterministic reduction orders for tests.
- Avoid timing-dependent logic in tests (like “if job finishes in < X seconds then …”).
If the production algorithm is inherently non-deterministic, consider enabling an optional deterministic testing mode (e.g., controlled scheduling, deterministic reductions) that is only used during tests.
Designing robust numerical checks
Floating-point results may differ slightly between configurations even for correct behavior. Testing strategies:
- Check value ranges or relative/absolute error:
- Use tolerances: e.g.,
|x_test - x_ref| <= atol + rtol * |x_ref|. - Use checksums or norms:
- Compare $L_1$, $L_2$, or $L_\infty$ norms of arrays instead of every element.
- Prefer invariants:
- Conservation laws (mass, energy), monotonicity, symmetry.
Keep tolerances tight enough to catch real regressions but not so tight that harmless rounding differences fail the test.
Testing for race conditions and deadlocks
Some issues in parallel codes are schedule-sensitive and may not show up in basic runs.
Schedule diversity
To expose more interleavings:
- Change the number of threads or process counts in tests.
- Run the same test multiple times in a loop.
- If possible, introduce controlled random delays in certain places in debug builds to change timing.
For shared-memory code, some runtimes offer environment variables or debug options to change scheduling strategies, which can help surface races.
Timeout-based hang detection
Deadlocks often appear as hangs. Automate detection by:
- Wrapping tests in a timeout:
- If a parallel test exceeds some reasonable multiple of its usual runtime, mark it as failed and gather diagnostics.
- Using job scheduler time limits for larger-scale tests.
Ensure tests emit enough logging/heartbeat information that you can see where the hang occurred (e.g., “entering MPI_Allreduce on rank 17”).
Golden reference and cross-configuration comparisons
A powerful strategy for testing parallel codes is to compare outputs against trusted references.
Serial or low-parallel “golden” runs
Often, there is a simpler, trusted version of the algorithm:
- Serial implementation.
- Single-process MPI run.
- Low-thread OpenMP run.
You can:
- Generate canonical reference outputs (files, checksums, metrics) from these golden runs.
- In parallel tests, compare:
- Final results.
- Key aggregated quantities.
- Statistical properties for stochastic codes.
This helps separate “parallelization changed the answer” from “algorithm changed the answer.”
Cross-configuration consistency checks
Even without an external reference, compare different parallel configurations:
Nranks vs2Nranks with the same physical parameters.- Different domain decompositions (e.g., 1D vs 2D partitioning) that should be mathematically equivalent.
- Different thread counts.
Consistency constraints:
- For deterministic models: results should be identical within numerical tolerance.
- For stochastic models: distributions or summary statistics should agree within uncertainty bounds.
Testing MPI-based codes
Testing strategies must account for the multi-process nature of MPI applications.
Structuring MPI tests
Use mpirun/srun in your test harness to run tests with specific process counts:
- Provide test executables that:
- Initialize MPI.
- Run a suite of checks.
- Call
MPI_Abortor return non-zero on failure.
Test logic should carefully manage:
- Rank-specific behavior:
- Only rank 0 writes summary messages, unless all ranks log to separate files.
- Collective assertions:
- If any rank detects an error, coordinate a clean abort or clearly indicate which rank failed.
Partitioning tests by rank count
Design some tests that only make sense for specific communicator sizes:
- Tests that require at least 4 ranks (e.g., 2D halo exchange).
- Tests for odd/even numbers of ranks.
- Tests for power-of-two communicator sizes (common in collective algorithms).
Use test harness logic to skip tests when the communicator size is incompatible, with a clear message (not a silent pass).
Testing OpenMP and other shared-memory parallel codes
The focus here is on strategically probing thread-parallel behavior.
Parameter sweeps over threads
Run tests with:
- 1 thread: baseline, ensures no dependence on parallel features.
- 2 threads: minimal concurrency, often enough to expose many race conditions.
- Several threads up to the practical maximum for your test environment.
This can be automated by running the same test binary multiple times with different OMP_NUM_THREADS settings.
Thread-safe checks and assertions
For OpenMP tests:
- Be careful where you place assertions:
- Assertions inside parallel regions should avoid race conditions themselves.
- Consider using
#pragma omp criticalor reductions when aggregating test results. - Collect error flags in thread-private variables, then reduce to a shared status at the end.
Design summaries at the end of a parallel region rather than printing from every thread in the middle (both for performance and to keep logs readable).
Hybrid (MPI + threads/GPU) testing strategies
Hybrid codes mix MPI with shared-memory and/or accelerators.
Layered testing
Test each layer separately where possible:
- MPI layer:
- Domain decomposition, halo exchanges, collectives with simplified per-rank computation.
- Thread/GPU layer:
- Kernels or threaded loops on a single rank with realistic data sizes.
Then add tests that exercise the full hybrid stack:
- Few nodes, few ranks per node, few threads per rank.
- With and without accelerators enabled.
Configuration-based disabling
For automated tests on systems where GPUs or many cores are not available:
- Provide configuration options to disable or mock accelerators and large-scale parallelism.
- Ensure that tests still cover as much logic as possible in these reduced modes.
Property-based and invariant-based testing
Beyond fixed input–output pairs, you can test that your code always preserves certain properties.
Examples of useful properties
- Conservation properties:
- Total mass or energy stays constant (or changes with a known rate).
- Symmetries:
- Results invariant under certain transformations (e.g., rotation, permutation of processes).
- Monotonicity or bounds:
- Physical quantities stay within physically meaningful limits (non-negative densities).
Property-based testing frameworks (e.g., QuickCheck-style tools) can generate many random inputs in small problem sizes and check that these properties hold, increasing coverage of corner cases.
Parallel-specific invariants
Parallel distribution adds extra invariants:
- Data partitions cover the domain exactly with no overlaps or gaps.
- Sum of local counts equals global count from a collective.
- After communication steps (halo exchange, gather/scatter), consistency checks across ranks.
Build such checks into test-only code paths to catch distribution and communication errors early.
Testing performance-related behavior
While correctness is primary, performance regressions in parallel codes can also be considered test failures.
Performance regression tests
For selected kernels or workflows:
- Record baseline timings or flop rates for given test problems and configurations.
- In CI or scheduled tests:
- Re-measure performance.
- Flag regressions beyond a tolerance (e.g., more than 10–20% slowdown).
These tests should be:
- Stable (avoid tiny problems where noise dominates).
- Resource-aware (do not require many nodes in a CI environment).
Scalability sanity checks
You can include simple scaling sanity tests:
- Verify that runtime with
2ranks is not massively worse than with1rank for the same problem. - Verify that time with
4threads is not significantly slower than with1thread for compute-bound kernels.
These are not full scalability studies, but they can catch gross misconfigurations or algorithmic regressions.
Continuous integration and automation for parallel tests
Integrating parallel tests into automated workflows requires some adaptation.
CI integration patterns
- Use containerized environments or cluster-based CI runners that support:
mpirun/srun.- Setting environment variables for OpenMP, CUDA, etc.
- Distinguish:
- Fast tests (tiny problems, few ranks/threads) run on every commit.
- Extended tests (larger problems, more ranks/threads) run nightly/weekly or before releases.
CI configuration can define multiple jobs, each with specific parallel settings.
Managing flaky tests
Non-deterministic parallel bugs sometimes show as intermittent failures:
- Mark tests as flaky only as a temporary measure:
- Retry once before failing.
- Log detailed context on failure:
- Configuration (ranks, threads, hardware, compiler).
- Random seeds used.
- Any relevant environment variables.
Use this information to reproduce and fix underlying issues rather than permanently accepting flakiness.
Practical guidelines and best practices
To summarize concrete strategies for testing parallel HPC codes:
- Design for testability:
- Separate pure computation from parallel orchestration.
- Provide deterministic or debug modes when possible.
- Test at multiple scales and configurations:
- Single-process/thread baseline.
- Small to medium parallel configurations in automated runs.
- Use reference runs and invariants:
- Compare against serial/golden outputs or cross-configuration results.
- Check physical and parallel invariants (conservation, partitioning).
- Proactively search for concurrency issues:
- Vary thread/rank counts.
- Run stress tests and use timeouts for hangs.
- Automate and document:
- Integrate parallel tests into CI.
- Document which tests require which resources and how to interpret their results.
These strategies make parallel HPC codes more reliable and maintainable, even as they evolve and target new architectures and scales.