Table of Contents
Motivation for Restart Mechanisms
Restart mechanisms are the methods and conventions that allow an application to continue from a previous state instead of starting from scratch. While checkpointing focuses on writing state at specific times, restart mechanisms are about:
- How the application is structured to be restartable
- How it finds and interprets checkpoint data
- How it restores internal state and resumes work consistently
Effective restart mechanisms are critical when:
- Jobs run close to or beyond wall-time limits
- Hardware failures are non-negligible
- Long simulations must be extended, branched, or analyzed at multiple stages
- Results must be reproducible and auditable over time
Design Principles for Restartable Applications
Idempotent and deterministic behavior
Restarted runs should either:
- Produce identical results to an uninterrupted run (bitwise or within defined numerical tolerance), or
- Deviate only in well-understood, documented ways (e.g., due to different random seeds or non-deterministic reductions)
This implies:
- Avoiding hidden global state that is not captured in checkpoints (e.g., static variables that affect control flow)
- Carefully managing random number generators:
- Save RNG seeds or streams in the checkpoint
- Or derive seeds deterministically from a global seed and iteration indices
- Ensuring that the sequence of operations after restart is consistent with the original execution path
Clear definition of “restart point”
A restart point is a logically consistent state in which all application invariants hold. The restart mechanism should define:
- At what logical time restarts can occur (e.g., after full time steps, after completing an iteration, after finishing a phase)
- What work is considered done versus pending
- How progress is tracked (e.g., current timestep index, global iteration count, simulation time $t$)
Restart points should align with natural synchronization points to avoid partial or inconsistent states.
Separation of concerns
Structure the application so that:
- Core computation logic (e.g., time stepping, iterative solver) is separate from:
- Checkpoint writing
- Restart initialization
- Command-line or configuration parsing
This makes it easier to extend, test, and maintain restart behavior.
Basic Restart Workflow
A typical restart-capable application follows this high-level logic:
- Startup phase
- Parse command-line options and configuration files.
- Decide whether this is:
- A fresh run (no prior state), or
- A restart (resume from a previous checkpoint).
- Initialization
- For a fresh run:
- Initialize domain, parameters, and data structures from input files or analytic setups.
- For a restart:
- Discover which checkpoint to use.
- Read checkpoint files.
- Rebuild all necessary internal state.
- Main computation loop
- At each logical step (e.g., timestep, iteration):
- Perform computation.
- Optionally write checkpoints based on policies defined elsewhere.
- Shutdown / finalization
- Write final outputs and logs.
- Record metadata that helps future restarts or reproducibility.
The restart mechanism is primarily concerned with steps 1 and 2, and with how the main loop interacts with checkpoints.
Discovering and Selecting Checkpoints
Policies for choosing a checkpoint
When multiple checkpoints exist, a restart mechanism must decide which to use. Common approaches:
- Latest valid checkpoint (default)
- Find checkpoint with the highest completed step/timestep index.
- Verify file integrity (size, checksum, or a small header with a magic number/format version).
- Optionally support falling back to an older checkpoint if the latest is corrupt or incomplete.
- User-specified checkpoint
- Allow the user to select a specific checkpoint via:
- Command-line option (e.g.,
--restart-from step_1000) - Configuration file
- Environment variable
- Useful for:
- Branching runs (e.g., exploring different parameters from a shared starting state)
- Debugging problematic steps
- Best-effort recovery
- Enumerate available checkpoints in a directory.
- Prefer the most recent one that passes basic consistency checks.
Naming and organization conventions
Good checkpoint naming and structure simplify restarts:
- Include key identifiers in filenames:
- Step or timestep index:
chkpt_step_000100.h5 - Possibly simulation ID or parameter hash for easier bookkeeping
- Use per-job or per-run subdirectories to avoid mixing checkpoints:
run_001/checkpoints/- Include metadata in each checkpoint or in a sidecar file, e.g.:
- Build version, git commit, or hash
- Input configuration summary
- Date/time, machine, number of MPI ranks, processor grid
- Notes on format version (to detect incompatibilities)
This metadata is essential for robust restart logic and for detecting incompatible or outdated checkpoint formats.
Restoring State from Checkpoints
Minimal vs full-state restarts
There is a design choice about how much state must be captured and restored:
- Full-state restart
- Save all internal variables necessary to resume exactly:
- Primary solution fields
- Auxiliary fields
- Time variables
- Solver histories (e.g., previous iterates for multi-step schemes)
- RNG states
- Restart code simply reconstructs the in-memory state and continues the loop.
- Minimal-state restart
- Save only what cannot be cheaply recomputed:
- Solution fields
- Coarse meta-state like current time, iteration index
- On restart:
- Re-run some initialization or setup stages
- Reconstruct derived data from primary state
- Reduces I/O volume but may increase restart time and complexity.
The restart mechanism must ensure consistency regardless of which approach is used.
Reconstructing control flow
To resume correctly, the program needs enough information to know:
- What logical step was last completed (e.g.,
current_step) - What phase or stage of the algorithm was in progress:
- E.g., within a time step: predictor/corrector stage, sub-iterations
- Outer vs inner iterations in multi-physics codes
Typical strategy:
- Only write checkpoints at well-defined outer-loop boundaries, e.g.:
for step in range(last_step+1, final_step+1):
advance_one_step()
if should_checkpoint(step):
write_checkpoint(step)- In the checkpoint, store:
- The completed step number (
last_step) - The logical simulation time
t - Any subcycling or stage counters if necessary
On restart:
- Read
last_stepandt - Initialize the loop to
step = last_step + 1 - Resume the same control logic as in a fresh run
Handling partially written checkpoints
Restart mechanisms must handle the possibility that a job fails while writing a checkpoint, leaving:
- Truncated or corrupt files
- Inconsistent sets of files across ranks
Standard techniques:
- Atomicity via temporary files:
- Write to a temporary name (e.g.,
chkpt_step_0100.tmp). - Only after successful completion, rename to
chkpt_step_0100. - Restart logic treats only non-
.tmpfiles as candidates. - Versioned or double-buffered checkpoints:
- Keep at least two generations:
chkpt_Aandchkpt_B. - Always write a new generation fully before updating a small “pointer” file (e.g.,
latest.chkpt) that tells restarts which one is valid. - Restart uses the checkpoint indicated by the pointer and can fall back to the previous one if needed.
- Internal consistency checks:
- Include magic numbers, format version, and checksums in checkpoint headers.
- Skip any candidate checkpoint that fails validation.
The restart mechanism should be conservative: it is better to lose a small amount of progress than to resume from corrupted state.
Parallel and Distributed Restart
In parallel applications, restart mechanisms must be consistent across all ranks or threads.
MPI-based codes
Challenges and patterns:
- File layout choices:
- One global file vs one file per rank vs grouped files
- Restart mechanism must know how many ranks produced each checkpoint and how data is partitioned.
- Consistent metadata across ranks:
- Store the global number of processes and domain decomposition information in the checkpoint.
- On restart, validate:
- Whether the same number of ranks is used
- Whether a different decomposition is allowed and supported (e.g., via remapping tools)
- Collective coordination on restart:
- All ranks must:
- Agree on which checkpoint to load
- Participate in I/O calls in the same order and with compatible parameters
- A common pattern:
- Rank 0 reads the “latest checkpoint” metadata and broadcasts the chosen step and file names.
- Each rank then reads its own portion of the data.
- Changing parallel layout between runs:
- More advanced restart mechanisms support changing:
- Number of MPI ranks
- Process grid topology
- This requires:
- Storing data in a format that is independent of the original decomposition (e.g., global indexing)
- Logic to re-distribute data over a new process layout at restart.
Shared-memory and hybrid codes
For OpenMP or hybrid MPI+OpenMP codes:
- Avoid storing thread-local implementation details:
- Checkpoint global state and data structures; allow threads to reconstruct their per-thread state at runtime.
- Ensure that the restart process re-initializes:
- Thread pools
- Thread affinities or binding (if used)
- Any local caches or buffers used by threads
In all cases, the restart mechanism should treat parallel execution as an implementation detail; the checkpoint should represent a consistent global state.
Types of Restart Mechanisms
Cold vs warm vs hot restarts
- Cold restart
- Application starts from initial conditions only.
- No checkpoint is used.
- Typically used for brand new simulations.
- Warm restart
- Application restarts using a checkpoint taken at a natural boundary (e.g., end of timestep).
- Most common in batch job continuation and failure recovery.
- Requires complete state for accurate continuation.
- Hot restart
- Restart from a checkpoint taken during an inner iteration or sub-step.
- More complex; often requires saving more transient state.
- Used in specialized cases where granularity must be fine, e.g., very costly timesteps.
Most HPC applications implement warm restarts; hot restarts are justified only when restart intervals must be extremely short.
Manual vs automatic restart
- Manual restart
- User submits a new batch job specifying:
- Restart flag
- Path to checkpoints
- Updated wall-time or resource request
- Simple to implement and understand.
- Relies on user or workflow system to monitor failures and resubmit.
- Automatic restart within a workflow
- External tools (e.g., workflow managers, scripts) monitor job status.
- On failure or wall-time expiration:
- A new job is automatically submitted with restart parameters.
- Application itself may not know that a restart occurred; it just follows its restart logic.
- Self-managed continuation
- Application monitors cluster wall-time (e.g., via environment variables or scheduler APIs).
- Before wall-time, it:
- Writes a final checkpoint
- Exits cleanly with a code indicating “needs continuation”.
- A wrapper script or workflow reacts and resubmits the job.
The restart mechanism inside the application is the same, but integration with external tools determines how automatic the process is.
Robustness and Error Handling in Restarts
Handling incompatible or outdated checkpoints
Over time, application formats or data structures change. Restart logic should:
- Store a format version number and code version in each checkpoint
- On restart:
- Compare file version with the expected version
- If incompatible:
- Fail gracefully with a clear error message
- Optionally suggest a conversion tool if available
- Avoid silently misinterpreting old checkpoints
For long-lived codes, a small set of migration tools or compatibility layers may be necessary.
Dealing with partial outputs and side effects
Not all side effects are captured in checkpoints:
- Output files (e.g., diagnostic logs, derived fields)
- Post-processing status
- External databases or services
Restart mechanisms should:
- Ensure that core simulation state is consistent, even if some derived outputs are missing or duplicated.
- Record in the checkpoint what has been written so far (e.g., last output time for diagnostics).
- On restart:
- Optionally:
- Recompute missing outputs
- Or skip already-produced outputs based on recorded metadata
Application design should tolerate duplication of some outputs (e.g., repeated log messages) rather than risking loss of scientific state.
Integrating Restart with HPC Workflows
Restart-aware job scripts
Restart mechanisms interact with job scripts written for schedulers. Typical patterns:
- A single job script that:
- Checks for existing checkpoints.
- Chooses fresh run vs restart mode.
- Runs the executable with appropriate arguments.
- Looping or chained jobs:
- A job runs until close to wall-time, writes a checkpoint, and exits.
- The next job in the chain:
- Is submitted automatically
- Starts in restart mode from the last checkpoint.
The application’s restart mechanism must be predictable and well-documented so job scripts can reliably control behavior via flags or environment variables.
Provenance and auditability
For long projects, it is important to know:
- Which checkpoints were used to start which runs
- What parameters and code versions were in effect at each restart
Restart mechanisms can support this by:
- Logging restart events clearly to a text log:
- Checkpoint file name
- Step/time
- Code version
- Parameters
- Optionally writing a small “run history” or provenance file that links:
- Initial conditions
- All subsequent checkpoints and restarts
- Final outputs
This is essential for scientific reproducibility and debugging.
Testing and Validation of Restart Mechanisms
A restart mechanism is only trustworthy if it is tested systematically.
Basic equivalence tests
Common strategies:
- Short vs long run comparison
- Run A: Simulate from step 0 to $N$ without interruption.
- Run B: Simulate from step 0 to $K$, write a checkpoint, restart, and continue to $N$.
- Compare key metrics (fields, integrated quantities, diagnostics) at step $N$.
- Define acceptable tolerances, especially for floating-point operations.
- Multiple restart points
- Repeat the above with checkpoints at different steps $K$ (early, middle, late).
Failure simulation
Simulate failures to ensure restart robustness:
- Intentionally kill the job at random times around checkpointing.
- Corrupt or truncate a checkpoint file (on a test system) to see if:
- The application detects it
- Falls back gracefully if designed to
- Fails with a clear and helpful message
Regression testing
Integrate restart tests into automated testing:
- Unit tests for:
- Metadata reading/writing
- Version compatibility checks
- Integration tests for:
- Full restart workflow with MPI (where relevant)
- Various configuration options controlling restart behavior
Keeping restart mechanisms under continuous test helps prevent subtle breakage as the code evolves.
Best Practices Summary
Effective restart mechanisms in HPC applications should:
- Clearly distinguish fresh runs from restarts using robust command-line or configuration options.
- Define consistent and well-chosen restart points aligned with algorithmic boundaries.
- Store enough state to resume correctly, but avoid unnecessary data when possible.
- Use naming conventions, metadata, and versioning to enable safe checkpoint discovery and compatibility checks.
- Handle partial or corrupt checkpoints conservatively, with mechanisms such as temporary files and multiple generations.
- Work correctly and consistently in parallel environments, including MPI and hybrid setups.
- Integrate smoothly with scheduler job scripts and higher-level workflows.
- Be thoroughly tested through equivalence runs, failure simulations, and regression tests.
A well-implemented restart mechanism transforms checkpoint data into a powerful tool for reliability, scalability, and reproducibility across complex HPC workflows.