Table of Contents
Goals of checkpointing
Checkpointing is about periodically saving enough state of a running application so that, after a failure or interruption, you can restart from a recent point instead of from the very beginning. Strategy is about deciding:
- What to save
- When to save it
- How to save it
- Where to store it
A good checkpointing strategy balances:
- Overhead during normal execution (extra time and I/O)
- Expected wasted work after failures (how far back you have to restart)
- Storage usage and I/O pressure on the system
- Complexity (how hard it is to implement and maintain)
Types of checkpointing
Application-level vs. system-level
Application-level checkpointing
- The application explicitly writes its own restart files.
- Typically saves:
- Simulation state (fields, particles, matrices, time step, random seeds, etc.)
- Metadata needed to restart (iteration number, configuration parameters).
- Advantages:
- Can be compact and efficient, store only what is needed.
- Format can be stable and portable across systems.
- Often easier to evolve and version across code updates.
- Disadvantages:
- Developer effort required.
- Must be kept consistent with code changes.
System-level (transparent) checkpointing
- External tools (e.g., CRIU, BLCR, DMTCP, batch-system integration) capture the process state without application modifications.
- Typically saves:
- Process memory image, registers, open file descriptors, sometimes network state.
- Advantages:
- No or minimal code changes.
- Can capture a wide range of applications.
- Disadvantages:
- Large checkpoint sizes (full memory images).
- Portability issues (OS/compiler differences, node counts).
- Harder to integrate with application-level understanding of “consistent state.”
In HPC, application-level checkpointing is common for large-scale numerical codes; system-level is often used in research settings or as a safety net.
Coordinated vs. uncoordinated (distributed-memory codes)
For MPI and other distributed-memory applications, each process must produce checkpoints that together represent a consistent global state (no messages “in flight” that might be lost).
Coordinated checkpointing
- All processes reach a checkpoint at a logically synchronized point.
- Often implemented as:
- A global barrier or communication step.
- Each process writes its piece of the state (possibly using collective I/O).
- Advantages:
- Simple reasoning: one global “snapshot.”
- Easier restart: all ranks restart from the same step.
- Disadvantages:
- Global synchronization overhead.
- All processes perform I/O around the same time (I/O burst).
Uncoordinated checkpointing
- Each process independently decides when to checkpoint.
- Risks inconsistent states where some messages are recorded at sender but not at receiver (or vice versa).
- To make this workable, techniques like message logging must be used so that communications can be replayed.
- Typically more complex and less common in entry-level HPC practice.
Practical takeaway: most production HPC codes use coordinated, application-level checkpoints at well-defined iteration or time-step boundaries.
Full vs. incremental vs. differential
Full checkpoints
- Save all necessary state every time.
- Simple to implement and reason about.
- Higher I/O volume and storage cost.
Incremental checkpoints
- Save only the data that changed since the last checkpoint.
- Requires tracking changes (e.g., by pages, blocks, or logical data structures).
- Benefits:
- Reduced I/O for slowly changing state.
- Costs:
- More complex logic for creating and reading checkpoints.
- Restart may require reading multiple checkpoint files (base + increments).
Differential checkpoints
- Save differences relative to a fixed base snapshot rather than relative to the immediate previous one.
- Reduces restart complexity (base + 1 differential) but may result in larger differentials over time.
For many codes, the first step is simple full checkpoints, then consider incremental if I/O overhead becomes a bottleneck.
Synchronous vs. asynchronous checkpointing
Synchronous checkpointing
- The application waits until the checkpoint is fully written before continuing computations.
- Simple and robust but may cause noticeable pauses, especially with large data.
Asynchronous checkpointing
- Application initiates a checkpoint and continues computing while the data is being written in the background.
- Implemented via:
- Additional I/O threads or helper processes.
- Double-buffering: writing one buffer while computing into another.
- Reduces visible pause but adds implementation complexity and more elaborate error handling.
When to checkpoint: frequency and policies
Fixed-interval strategies
Most beginners start with a simple rule such as:
- “Write a checkpoint every N iterations”
- “Write a checkpoint every T minutes”
This is easy to configure and reason about, but may not be optimal for:
- Very short jobs (few checkpoints are enough)
- Very long jobs on unreliable systems (more frequent checkpoints needed)
- Phases of the simulation with variable cost per time step
Analytical trade-off: checkpoint interval vs. failure rate
There is a well-known trade-off between:
- Checkpoint overhead per interval (time spent writing)
- Expected lost work when a failure occurs
A classical approximation for the optimal time between checkpoints $T_{\text{opt}}$ is based on:
- $C$: time to write a checkpoint
- $MTBF$: mean time between failures on the system (or for the job)
A commonly used formula (Young/Daly approximation) is:
$$
T_{\text{opt}} \approx \sqrt{2 \, C \, MTBF}
$$
Interpretation:
- If checkpoints are expensive (large $C$), $T_{\text{opt}}$ grows.
- If failures are frequent (short $MTBF$), $T_{\text{opt}}$ shrinks.
In practice:
- $MTBF$ is often not precisely known.
- Many centers and applications adopt heuristic intervals (e.g., every 30–60 minutes) and tune based on experience and job duration.
Adaptive strategies
More advanced strategies adapt checkpoint frequency based on:
- Remaining wall time of the job allocation
- Observed I/O performance
- Progress rate (e.g., slower phases may need less frequent checkpoints)
- External signals (e.g., scheduler telling the job it will be preempted)
Examples:
- Increase checkpoint frequency near the end of a time allocation to ensure there is a recent restart point.
- Reduce checkpoints during intense I/O phases to avoid congestion.
- Trigger a checkpoint on receiving a job preemption or shutdown signal, if the system supports this.
Phase-aware checkpointing
Some applications have distinct phases:
- Initialization / setup
- Main time-stepping loop
- Post-processing or analysis
Strategies:
- Skip or minimize checkpoints during short initialization phases.
- Regular checkpoints during the main loop.
- Final checkpoint that includes any post-processed data needed to restart without repeating expensive preprocessing.
What to checkpoint: selecting state
The key is to checkpoint only what is necessary and sufficient to reconstruct the computation.
Common components:
- Simulation state:
- Primary fields or variables (e.g., densities, velocities, pressures).
- Particle positions/velocities.
- State vectors / solution arrays for iterative methods.
- Control state:
- Current iteration or time step.
- Physical time, step size, convergence flags.
- Random number generator state:
- Seeds or full RNG state to ensure reproducibility if stochastic elements are involved.
- Configuration and parameters:
- Input parameters used in this run (possibly embedded in the checkpoint).
- Code version, mesh description, boundary conditions.
Often not needed (or better reconstructed):
- Recomputable intermediates (e.g., temporary buffers, local scratch arrays).
- Derived quantities that can be recomputed at restart.
- Logging/debug information (can be separate from restart state).
A useful practice is to document in code what is considered checkpoint state and maintain this list as the code evolves.
For distributed-memory codes:
- Each rank typically stores only its local portion of the global data (e.g., domain decomposition).
- Metadata must describe:
- Decomposition layout.
- Mapping from ranks to data regions.
- Any ghost cell or halo structure.
Where to store checkpoints
Local vs. shared storage
Local storage (node-local disks, SSDs, NVMe, RAM disks)
- Pros:
- Lower latency, higher bandwidth.
- Reduces load on the shared parallel filesystem.
- Cons:
- Not persistent if the node fails or is reallocated.
- Harder to use for long-term restart or if jobs migrate.
Shared/parallel filesystem (e.g., Lustre, GPFS)
- Pros:
- Accessible from all nodes; survives individual node failures.
- Suitable for long-running jobs and inter-job restarts.
- Cons:
- Higher contention; I/O storms when many jobs checkpoint simultaneously.
- Higher latency; more sensitive to poor I/O patterns.
A common hybrid strategy:
- Checkpoint frequently to local storage.
- Periodically copy or stage “major” checkpoints to the parallel filesystem.
- Use this as a two-level scheme: fast local safety + durable global backup.
Redundancy and replication
To protect against storage failures:
- Store checkpoints in multiple locations, e.g., two directories on different filesystems.
- For critical runs, keep older checkpoint generations, not just the latest:
- Protects against corruption or bugs introduced in new checkpoint formats.
- Allows “rolling back” to a known good state.
Trade-offs:
- Increased storage consumption.
- More I/O time when duplicating checkpoints.
Policy examples:
- Keep the last N checkpoints (e.g., N=3) and delete older ones.
- For each “major” checkpoint, create a second copy on a different filesystem or tape/archive.
Structuring checkpoint files
Per-rank vs. collective checkpoint files
Per-rank (one file per MPI task)
- Each rank writes its own checkpoint file, e.g.,
chkpt_rank0001.h5. - Pros:
- Simple to implement; no inter-rank I/O coordination.
- Natural mapping between rank and file.
- Cons:
- Large number of files for big jobs (scalability, metadata overhead).
- Harder to manage and move.
Collective (few or one global file)
- Ranks use parallel I/O (e.g., MPI-IO, HDF5, NetCDF) to write into a shared file or a small set of files.
- Pros:
- Fewer files; better suited to large-scale jobs.
- Often better performance on parallel filesystems when implemented properly.
- Cons:
- More complex code.
- Failure during I/O can corrupt a large shared file.
A common compromise:
- A small number of files, e.g., one per node or one per MPI communicator group.
Versioning and metadata
Include metadata in or alongside checkpoints:
- Checkpoint version number (format version).
- Code git hash or build identifier.
- Compiler and library versions if relevant.
- Date/time and hostname.
- Description of domain decomposition and layout.
Benefits:
- Helps detect incompatibilities when restarting with a newer version of the code.
- Makes debugging restart issues far easier.
Naming schemes:
chkpt_step_000100_rank_0000.h5chkpt_t_0010.0_global.h5- Include time step or physical time, and possibly job ID.
Consistency and correctness
Achieving consistent global state
For parallel codes, a checkpoint should represent a state that could have arisen in a normal execution without checkpointing.
Key points:
- Write at synchronization points:
- After a global communication (e.g., collective operation).
- At the end of a time step where all processes have:
- Updated their local data.
- Exchanged halo or ghost cells.
- Avoid checkpointing:
- Mid-way through long communication phases.
- When buffers contain messages that have not yet been processed.
Simple practice:
- Place checkpoint calls at the end of the main iteration loop, after communications and updates.
- Use an MPI barrier or equivalent if needed to ensure all ranks reach the checkpoint stage together.
Restart validation
To ensure the strategy is correct:
- Restart tests:
- Run a simulation for some time, checkpoint, stop.
- Restart from that checkpoint and continue.
- Compare results against an uninterrupted reference run (bitwise or within acceptable numerical tolerance).
- Partial restart:
- Test restarting with a different number of processes if your design is supposed to support this.
- Verify that domain redistribution and mapping are correctly handled.
Debugging tips:
- Check that all critical state (time, iteration, seeds, parameters) is restored, not only the main fields.
- Ensure that I/O buffering is flushed and files are closed before assuming a checkpoint is valid.
Performance and scalability considerations
Reducing I/O overhead
- Compress checkpoint data (if CPU is cheaper than I/O time):
- Lossless compression (e.g., gzip, zstd, HDF5 filters).
- In some scientific contexts, lossy compression may be acceptable (e.g., floating-point compressors) but must be carefully evaluated.
- Minimize data volume:
- Remove redundant or recomputable data from checkpoints.
- Use more compact data types where appropriate.
- Align with filesystem characteristics:
- Use large, contiguous writes.
- Avoid many tiny writes and random access patterns.
- Batch metadata updates; reduce number of checkpoint files where possible.
Staggered checkpointing
To avoid I/O storms when many jobs checkpoint simultaneously:
- Randomize or stagger checkpoint times across jobs:
- Add a small random offset to the checkpoint schedule.
- Within a single application:
- If using multiple checkpoints (e.g., multi-level strategies), offset them across process groups.
Some centers provide guidance on recommended checkpoint intervals and times based on overall system load.
Multi-level checkpointing
Multi-level strategies use several “tiers” of reliability and cost:
- Level 1: Very frequent, cheap, local checkpoints (e.g., in-node memory-copy or SSD).
- Protects against soft errors or application crashes that don’t involve node failure.
- Level 2: Less frequent checkpoints to shared parallel filesystem.
- Protects against node failures.
- Level 3: Rare checkpoints to archival storage (e.g., burst buffers, tape).
- Protects long-running campaigns or highly valuable simulations.
This hierarchical approach aims to match checkpoint cost to expected failure modes and frequency.
Integration with job scheduling and workflow
Checkpointing and wall-time limits
In batch systems with strict wall-time limits:
- Plan checkpoint intervals so that:
- You can safely write a checkpoint before wall time expires.
- You do not lose too much computation if the job ends early (e.g., due to preemption).
- Common strategy:
- Estimate time needed for a checkpoint (possibly worst-case).
- Ensure that your final checkpoint is initiated well before job end (e.g., 2–3 checkpoint durations before wall-time limit).
Some schedulers may:
- Provide signals (e.g.,
SIGTERMsome minutes before killing the job). - Allow jobs to catch these signals and trigger a last-minute checkpoint.
Checkpointing in multi-stage workflows
In workflows involving multiple jobs or stages:
- Use checkpointing to:
- Bridge between stages (e.g., restart from a saved state with different resolution or parameters).
- Facilitate branching (e.g., restart the same state with different physics options).
- Document:
- Which checkpoint belongs to which stage.
- How to move from checkpoint A to stage B.
Workflow managers and scripts can:
- Automatically manage checkpoint naming.
- Copy or archive older checkpoints.
- Automate validation of successful restarts.
Practical guidelines and best practices
- Start with a simple coordinated, full application-level checkpoint at clear iteration boundaries.
- Choose an initial checkpoint interval based on:
- Job length (target multiple checkpoints over the full run).
- Reasonable overhead (e.g., checkpointing overhead < 5–10% of runtime).
- Record checkpoint metadata (version, parameters, build info).
- Design for incremental evolution:
- Plan how checkpoint formats can evolve while remaining backward compatible or at least detect incompatibility cleanly.
- Regularly test restart capability as part of your development and QA process.
- Monitor:
- Checkpoint sizes.
- Checkpoint duration.
- Impact on overall I/O performance (both for your job and at system scale).
- Coordinate with the HPC support team:
- Learn filesystem-specific recommendations.
- Understand any site policies about checkpoint intervals and data retention.
A thoughtful checkpointing strategy turns failures and interruptions from catastrophic losses into manageable inconveniences, and is essential for reliable large-scale HPC runs.