14.4 Restart mechanisms

Table of Contents

Motivation for Restart Mechanisms

Restart mechanisms are the methods and conventions that allow an application to continue from a previous state instead of starting from scratch. While checkpointing focuses on writing state at specific times, restart mechanisms are about:

How the application is structured to be restartable
How it finds and interprets checkpoint data
How it restores internal state and resumes work consistently

Effective restart mechanisms are critical when:

Jobs run close to or beyond wall-time limits
Hardware failures are non-negligible
Long simulations must be extended, branched, or analyzed at multiple stages
Results must be reproducible and auditable over time

Design Principles for Restartable Applications

Idempotent and deterministic behavior

Restarted runs should either:

Produce identical results to an uninterrupted run (bitwise or within defined numerical tolerance), or
Deviate only in well-understood, documented ways (e.g., due to different random seeds or non-deterministic reductions)

This implies:

Avoiding hidden global state that is not captured in checkpoints (e.g., static variables that affect control flow)
Carefully managing random number generators:

Save RNG seeds or streams in the checkpoint
Or derive seeds deterministically from a global seed and iteration indices

Ensuring that the sequence of operations after restart is consistent with the original execution path

Clear definition of “restart point”

A restart point is a logically consistent state in which all application invariants hold. The restart mechanism should define:

At what logical time restarts can occur (e.g., after full time steps, after completing an iteration, after finishing a phase)
What work is considered done versus pending
How progress is tracked (e.g., current timestep index, global iteration count, simulation time $t$)

Restart points should align with natural synchronization points to avoid partial or inconsistent states.

Separation of concerns

Structure the application so that:

Core computation logic (e.g., time stepping, iterative solver) is separate from:

Checkpoint writing
Restart initialization
Command-line or configuration parsing

This makes it easier to extend, test, and maintain restart behavior.

Basic Restart Workflow

A typical restart-capable application follows this high-level logic:

Startup phase

Parse command-line options and configuration files.
Decide whether this is:

A fresh run (no prior state), or
A restart (resume from a previous checkpoint).

Initialization

For a fresh run:

Initialize domain, parameters, and data structures from input files or analytic setups.

For a restart:

Discover which checkpoint to use.
Read checkpoint files.
Rebuild all necessary internal state.

Main computation loop

At each logical step (e.g., timestep, iteration):

Perform computation.
Optionally write checkpoints based on policies defined elsewhere.

Shutdown / finalization

Write final outputs and logs.
Record metadata that helps future restarts or reproducibility.

The restart mechanism is primarily concerned with steps 1 and 2, and with how the main loop interacts with checkpoints.

Discovering and Selecting Checkpoints

Policies for choosing a checkpoint

When multiple checkpoints exist, a restart mechanism must decide which to use. Common approaches:

Latest valid checkpoint (default)

Find checkpoint with the highest completed step/timestep index.
Verify file integrity (size, checksum, or a small header with a magic number/format version).
Optionally support falling back to an older checkpoint if the latest is corrupt or incomplete.

User-specified checkpoint

Allow the user to select a specific checkpoint via:

Command-line option (e.g., --restart-from step_1000)
Configuration file
Environment variable

Useful for:

Branching runs (e.g., exploring different parameters from a shared starting state)
Debugging problematic steps

Best-effort recovery

Enumerate available checkpoints in a directory.
Prefer the most recent one that passes basic consistency checks.

Naming and organization conventions

Good checkpoint naming and structure simplify restarts:

Include key identifiers in filenames:

Step or timestep index: chkpt_step_000100.h5
Possibly simulation ID or parameter hash for easier bookkeeping

Use per-job or per-run subdirectories to avoid mixing checkpoints:

run_001/checkpoints/

Include metadata in each checkpoint or in a sidecar file, e.g.:

Build version, git commit, or hash
Input configuration summary
Date/time, machine, number of MPI ranks, processor grid
Notes on format version (to detect incompatibilities)

This metadata is essential for robust restart logic and for detecting incompatible or outdated checkpoint formats.

Restoring State from Checkpoints

Minimal vs full-state restarts

There is a design choice about how much state must be captured and restored:

Full-state restart

Save all internal variables necessary to resume exactly:

Primary solution fields
Auxiliary fields
Time variables
Solver histories (e.g., previous iterates for multi-step schemes)
RNG states

Restart code simply reconstructs the in-memory state and continues the loop.

Minimal-state restart

Save only what cannot be cheaply recomputed:

Solution fields
Coarse meta-state like current time, iteration index

On restart:

Re-run some initialization or setup stages
Reconstruct derived data from primary state

Reduces I/O volume but may increase restart time and complexity.

The restart mechanism must ensure consistency regardless of which approach is used.

Reconstructing control flow

To resume correctly, the program needs enough information to know:

What logical step was last completed (e.g., current_step)
What phase or stage of the algorithm was in progress:

E.g., within a time step: predictor/corrector stage, sub-iterations
Outer vs inner iterations in multi-physics codes

Typical strategy:

Only write checkpoints at well-defined outer-loop boundaries, e.g.:

  for step in range(last_step+1, final_step+1):
      advance_one_step()
      if should_checkpoint(step):
          write_checkpoint(step)

In the checkpoint, store:

The completed step number (last_step)
The logical simulation time t
Any subcycling or stage counters if necessary

On restart:

Read last_step and t
Initialize the loop to step = last_step + 1
Resume the same control logic as in a fresh run

Handling partially written checkpoints

Restart mechanisms must handle the possibility that a job fails while writing a checkpoint, leaving:

Truncated or corrupt files
Inconsistent sets of files across ranks

Standard techniques:

Atomicity via temporary files:

Write to a temporary name (e.g., chkpt_step_0100.tmp).
Only after successful completion, rename to chkpt_step_0100.
Restart logic treats only non-.tmp files as candidates.

Versioned or double-buffered checkpoints:

Keep at least two generations: chkpt_A and chkpt_B.
Always write a new generation fully before updating a small “pointer” file (e.g., latest.chkpt) that tells restarts which one is valid.
Restart uses the checkpoint indicated by the pointer and can fall back to the previous one if needed.

Internal consistency checks:

Include magic numbers, format version, and checksums in checkpoint headers.
Skip any candidate checkpoint that fails validation.

The restart mechanism should be conservative: it is better to lose a small amount of progress than to resume from corrupted state.

Parallel and Distributed Restart

In parallel applications, restart mechanisms must be consistent across all ranks or threads.

MPI-based codes

Challenges and patterns:

File layout choices:

One global file vs one file per rank vs grouped files
Restart mechanism must know how many ranks produced each checkpoint and how data is partitioned.

Consistent metadata across ranks:

Store the global number of processes and domain decomposition information in the checkpoint.
On restart, validate:

Whether the same number of ranks is used
Whether a different decomposition is allowed and supported (e.g., via remapping tools)

Collective coordination on restart:

All ranks must:

Agree on which checkpoint to load
Participate in I/O calls in the same order and with compatible parameters

A common pattern:

Rank 0 reads the “latest checkpoint” metadata and broadcasts the chosen step and file names.
Each rank then reads its own portion of the data.

Changing parallel layout between runs:

More advanced restart mechanisms support changing:

Number of MPI ranks
Process grid topology

This requires:

Storing data in a format that is independent of the original decomposition (e.g., global indexing)
Logic to re-distribute data over a new process layout at restart.

Shared-memory and hybrid codes

For OpenMP or hybrid MPI+OpenMP codes:

Avoid storing thread-local implementation details:

Checkpoint global state and data structures; allow threads to reconstruct their per-thread state at runtime.

Ensure that the restart process re-initializes:

Thread pools
Thread affinities or binding (if used)
Any local caches or buffers used by threads

In all cases, the restart mechanism should treat parallel execution as an implementation detail; the checkpoint should represent a consistent global state.

Types of Restart Mechanisms

Cold vs warm vs hot restarts

Cold restart

Application starts from initial conditions only.
No checkpoint is used.
Typically used for brand new simulations.

Warm restart

Application restarts using a checkpoint taken at a natural boundary (e.g., end of timestep).
Most common in batch job continuation and failure recovery.
Requires complete state for accurate continuation.

Hot restart

Restart from a checkpoint taken during an inner iteration or sub-step.
More complex; often requires saving more transient state.
Used in specialized cases where granularity must be fine, e.g., very costly timesteps.

Most HPC applications implement warm restarts; hot restarts are justified only when restart intervals must be extremely short.

Manual vs automatic restart

Manual restart

User submits a new batch job specifying:

Restart flag
Path to checkpoints
Updated wall-time or resource request

Simple to implement and understand.
Relies on user or workflow system to monitor failures and resubmit.

Automatic restart within a workflow

External tools (e.g., workflow managers, scripts) monitor job status.
On failure or wall-time expiration:

A new job is automatically submitted with restart parameters.

Application itself may not know that a restart occurred; it just follows its restart logic.

Self-managed continuation

Application monitors cluster wall-time (e.g., via environment variables or scheduler APIs).
Before wall-time, it:

Writes a final checkpoint
Exits cleanly with a code indicating “needs continuation”.

A wrapper script or workflow reacts and resubmits the job.

The restart mechanism inside the application is the same, but integration with external tools determines how automatic the process is.

Robustness and Error Handling in Restarts

Handling incompatible or outdated checkpoints

Over time, application formats or data structures change. Restart logic should:

Store a format version number and code version in each checkpoint
On restart:

Compare file version with the expected version
If incompatible:

Fail gracefully with a clear error message
Optionally suggest a conversion tool if available

Avoid silently misinterpreting old checkpoints

For long-lived codes, a small set of migration tools or compatibility layers may be necessary.

Dealing with partial outputs and side effects

Not all side effects are captured in checkpoints:

Output files (e.g., diagnostic logs, derived fields)
Post-processing status
External databases or services

Restart mechanisms should:

Ensure that core simulation state is consistent, even if some derived outputs are missing or duplicated.
Record in the checkpoint what has been written so far (e.g., last output time for diagnostics).
On restart:

Optionally:

Recompute missing outputs
Or skip already-produced outputs based on recorded metadata

Application design should tolerate duplication of some outputs (e.g., repeated log messages) rather than risking loss of scientific state.

Integrating Restart with HPC Workflows

Restart-aware job scripts

Restart mechanisms interact with job scripts written for schedulers. Typical patterns:

A single job script that:

Checks for existing checkpoints.
Chooses fresh run vs restart mode.
Runs the executable with appropriate arguments.

Looping or chained jobs:

A job runs until close to wall-time, writes a checkpoint, and exits.
The next job in the chain:

Is submitted automatically
Starts in restart mode from the last checkpoint.

The application’s restart mechanism must be predictable and well-documented so job scripts can reliably control behavior via flags or environment variables.

Provenance and auditability

For long projects, it is important to know:

Which checkpoints were used to start which runs
What parameters and code versions were in effect at each restart

Restart mechanisms can support this by:

Logging restart events clearly to a text log:

Checkpoint file name
Step/time
Code version
Parameters

Optionally writing a small “run history” or provenance file that links:

Initial conditions
All subsequent checkpoints and restarts
Final outputs

This is essential for scientific reproducibility and debugging.

Testing and Validation of Restart Mechanisms

A restart mechanism is only trustworthy if it is tested systematically.

Basic equivalence tests

Common strategies:

Short vs long run comparison

Run A: Simulate from step 0 to $N$ without interruption.
Run B: Simulate from step 0 to $K$, write a checkpoint, restart, and continue to $N$.
Compare key metrics (fields, integrated quantities, diagnostics) at step $N$.
Define acceptable tolerances, especially for floating-point operations.

Multiple restart points

Repeat the above with checkpoints at different steps $K$ (early, middle, late).

Failure simulation

Simulate failures to ensure restart robustness:

Intentionally kill the job at random times around checkpointing.
Corrupt or truncate a checkpoint file (on a test system) to see if:

The application detects it
Falls back gracefully if designed to
Fails with a clear and helpful message

Regression testing

Integrate restart tests into automated testing:

Unit tests for:

Metadata reading/writing
Version compatibility checks

Integration tests for:

Full restart workflow with MPI (where relevant)
Various configuration options controlling restart behavior

Keeping restart mechanisms under continuous test helps prevent subtle breakage as the code evolves.

Best Practices Summary

Effective restart mechanisms in HPC applications should:

Clearly distinguish fresh runs from restarts using robust command-line or configuration options.
Define consistent and well-chosen restart points aligned with algorithmic boundaries.
Store enough state to resume correctly, but avoid unnecessary data when possible.
Use naming conventions, metadata, and versioning to enable safe checkpoint discovery and compatibility checks.
Handle partial or corrupt checkpoints conservatively, with mechanisms such as temporary files and multiple generations.
Work correctly and consistently in parallel environments, including MPI and hybrid setups.
Integrate smoothly with scheduler job scripts and higher-level workflows.
Be thoroughly tested through equivalence runs, failure simulations, and regression tests.

A well-implemented restart mechanism transforms checkpoint data into a powerful tool for reliability, scalability, and reproducibility across complex HPC workflows.

Comments

Please login to add a comment.

Don't have an account? Register now!