14.3 Checkpointing strategies

Table of Contents

Goals of checkpointing

Checkpointing is about periodically saving enough state of a running application so that, after a failure or interruption, you can restart from a recent point instead of from the very beginning. Strategy is about deciding:

What to save
When to save it
How to save it
Where to store it

A good checkpointing strategy balances:

Overhead during normal execution (extra time and I/O)
Expected wasted work after failures (how far back you have to restart)
Storage usage and I/O pressure on the system
Complexity (how hard it is to implement and maintain)

Types of checkpointing

Application-level vs. system-level

Application-level checkpointing

The application explicitly writes its own restart files.
Typically saves:

Simulation state (fields, particles, matrices, time step, random seeds, etc.)
Metadata needed to restart (iteration number, configuration parameters).

Advantages:

Can be compact and efficient, store only what is needed.
Format can be stable and portable across systems.
Often easier to evolve and version across code updates.

Disadvantages:

Developer effort required.
Must be kept consistent with code changes.

System-level (transparent) checkpointing

External tools (e.g., CRIU, BLCR, DMTCP, batch-system integration) capture the process state without application modifications.
Typically saves:

Process memory image, registers, open file descriptors, sometimes network state.

Advantages:

No or minimal code changes.
Can capture a wide range of applications.

Disadvantages:

Large checkpoint sizes (full memory images).
Portability issues (OS/compiler differences, node counts).
Harder to integrate with application-level understanding of “consistent state.”

In HPC, application-level checkpointing is common for large-scale numerical codes; system-level is often used in research settings or as a safety net.

Coordinated vs. uncoordinated (distributed-memory codes)

For MPI and other distributed-memory applications, each process must produce checkpoints that together represent a consistent global state (no messages “in flight” that might be lost).

Coordinated checkpointing

All processes reach a checkpoint at a logically synchronized point.
Often implemented as:

A global barrier or communication step.
Each process writes its piece of the state (possibly using collective I/O).

Advantages:

Simple reasoning: one global “snapshot.”
Easier restart: all ranks restart from the same step.

Disadvantages:

Global synchronization overhead.
All processes perform I/O around the same time (I/O burst).

Uncoordinated checkpointing

Each process independently decides when to checkpoint.
Risks inconsistent states where some messages are recorded at sender but not at receiver (or vice versa).
To make this workable, techniques like message logging must be used so that communications can be replayed.
Typically more complex and less common in entry-level HPC practice.

Practical takeaway: most production HPC codes use coordinated, application-level checkpoints at well-defined iteration or time-step boundaries.

Full vs. incremental vs. differential

Full checkpoints

Save all necessary state every time.
Simple to implement and reason about.
Higher I/O volume and storage cost.

Incremental checkpoints

Save only the data that changed since the last checkpoint.
Requires tracking changes (e.g., by pages, blocks, or logical data structures).
Benefits:

Reduced I/O for slowly changing state.

Costs:

More complex logic for creating and reading checkpoints.
Restart may require reading multiple checkpoint files (base + increments).

Differential checkpoints

Save differences relative to a fixed base snapshot rather than relative to the immediate previous one.
Reduces restart complexity (base + 1 differential) but may result in larger differentials over time.

For many codes, the first step is simple full checkpoints, then consider incremental if I/O overhead becomes a bottleneck.

Synchronous vs. asynchronous checkpointing

Synchronous checkpointing

The application waits until the checkpoint is fully written before continuing computations.
Simple and robust but may cause noticeable pauses, especially with large data.

Asynchronous checkpointing

Application initiates a checkpoint and continues computing while the data is being written in the background.
Implemented via:

Additional I/O threads or helper processes.
Double-buffering: writing one buffer while computing into another.

Reduces visible pause but adds implementation complexity and more elaborate error handling.

When to checkpoint: frequency and policies

Fixed-interval strategies

Most beginners start with a simple rule such as:

“Write a checkpoint every N iterations”
“Write a checkpoint every T minutes”

This is easy to configure and reason about, but may not be optimal for:

Very short jobs (few checkpoints are enough)
Very long jobs on unreliable systems (more frequent checkpoints needed)
Phases of the simulation with variable cost per time step

Analytical trade-off: checkpoint interval vs. failure rate

There is a well-known trade-off between:

Checkpoint overhead per interval (time spent writing)
Expected lost work when a failure occurs

A classical approximation for the optimal time between checkpoints $T_{\text{opt}}$ is based on:

$C$: time to write a checkpoint
$MTBF$: mean time between failures on the system (or for the job)

A commonly used formula (Young/Daly approximation) is:

$$
T_{\text{opt}} \approx \sqrt{2 \, C \, MTBF}
$$

Interpretation:

If checkpoints are expensive (large $C$), $T_{\text{opt}}$ grows.
If failures are frequent (short $MTBF$), $T_{\text{opt}}$ shrinks.

In practice:

$MTBF$ is often not precisely known.
Many centers and applications adopt heuristic intervals (e.g., every 30–60 minutes) and tune based on experience and job duration.

Adaptive strategies

More advanced strategies adapt checkpoint frequency based on:

Remaining wall time of the job allocation
Observed I/O performance
Progress rate (e.g., slower phases may need less frequent checkpoints)
External signals (e.g., scheduler telling the job it will be preempted)

Examples:

Increase checkpoint frequency near the end of a time allocation to ensure there is a recent restart point.
Reduce checkpoints during intense I/O phases to avoid congestion.
Trigger a checkpoint on receiving a job preemption or shutdown signal, if the system supports this.

Phase-aware checkpointing

Some applications have distinct phases:

Initialization / setup
Main time-stepping loop
Post-processing or analysis

Strategies:

Skip or minimize checkpoints during short initialization phases.
Regular checkpoints during the main loop.
Final checkpoint that includes any post-processed data needed to restart without repeating expensive preprocessing.

What to checkpoint: selecting state

The key is to checkpoint only what is necessary and sufficient to reconstruct the computation.

Common components:

Simulation state:

Primary fields or variables (e.g., densities, velocities, pressures).
Particle positions/velocities.
State vectors / solution arrays for iterative methods.

Control state:

Current iteration or time step.
Physical time, step size, convergence flags.

Random number generator state:

Seeds or full RNG state to ensure reproducibility if stochastic elements are involved.

Configuration and parameters:

Input parameters used in this run (possibly embedded in the checkpoint).
Code version, mesh description, boundary conditions.

Often not needed (or better reconstructed):

Recomputable intermediates (e.g., temporary buffers, local scratch arrays).
Derived quantities that can be recomputed at restart.
Logging/debug information (can be separate from restart state).

A useful practice is to document in code what is considered checkpoint state and maintain this list as the code evolves.

For distributed-memory codes:

Each rank typically stores only its local portion of the global data (e.g., domain decomposition).
Metadata must describe:

Decomposition layout.
Mapping from ranks to data regions.
Any ghost cell or halo structure.

Where to store checkpoints

Local vs. shared storage

Local storage (node-local disks, SSDs, NVMe, RAM disks)

Pros:

Lower latency, higher bandwidth.
Reduces load on the shared parallel filesystem.

Cons:

Not persistent if the node fails or is reallocated.
Harder to use for long-term restart or if jobs migrate.

Shared/parallel filesystem (e.g., Lustre, GPFS)

Pros:

Accessible from all nodes; survives individual node failures.
Suitable for long-running jobs and inter-job restarts.

Cons:

Higher contention; I/O storms when many jobs checkpoint simultaneously.
Higher latency; more sensitive to poor I/O patterns.

A common hybrid strategy:

Checkpoint frequently to local storage.
Periodically copy or stage “major” checkpoints to the parallel filesystem.
Use this as a two-level scheme: fast local safety + durable global backup.

Redundancy and replication

To protect against storage failures:

Store checkpoints in multiple locations, e.g., two directories on different filesystems.
For critical runs, keep older checkpoint generations, not just the latest:

Protects against corruption or bugs introduced in new checkpoint formats.
Allows “rolling back” to a known good state.

Trade-offs:

Increased storage consumption.
More I/O time when duplicating checkpoints.

Policy examples:

Keep the last N checkpoints (e.g., N=3) and delete older ones.
For each “major” checkpoint, create a second copy on a different filesystem or tape/archive.

Structuring checkpoint files

Per-rank vs. collective checkpoint files

Per-rank (one file per MPI task)

Each rank writes its own checkpoint file, e.g., chkpt_rank0001.h5.
Pros:

Simple to implement; no inter-rank I/O coordination.
Natural mapping between rank and file.

Cons:

Large number of files for big jobs (scalability, metadata overhead).
Harder to manage and move.

Collective (few or one global file)

Ranks use parallel I/O (e.g., MPI-IO, HDF5, NetCDF) to write into a shared file or a small set of files.
Pros:

Fewer files; better suited to large-scale jobs.
Often better performance on parallel filesystems when implemented properly.

Cons:

More complex code.
Failure during I/O can corrupt a large shared file.

A common compromise:

A small number of files, e.g., one per node or one per MPI communicator group.

Versioning and metadata

Include metadata in or alongside checkpoints:

Checkpoint version number (format version).
Code git hash or build identifier.
Compiler and library versions if relevant.
Date/time and hostname.
Description of domain decomposition and layout.

Benefits:

Helps detect incompatibilities when restarting with a newer version of the code.
Makes debugging restart issues far easier.

Naming schemes:

chkpt_step_000100_rank_0000.h5
chkpt_t_0010.0_global.h5
Include time step or physical time, and possibly job ID.

Consistency and correctness

Achieving consistent global state

For parallel codes, a checkpoint should represent a state that could have arisen in a normal execution without checkpointing.

Key points:

Write at synchronization points:

After a global communication (e.g., collective operation).
At the end of a time step where all processes have:

Updated their local data.
Exchanged halo or ghost cells.

Avoid checkpointing:

Mid-way through long communication phases.
When buffers contain messages that have not yet been processed.

Simple practice:

Place checkpoint calls at the end of the main iteration loop, after communications and updates.
Use an MPI barrier or equivalent if needed to ensure all ranks reach the checkpoint stage together.

Restart validation

To ensure the strategy is correct:

Restart tests:

Run a simulation for some time, checkpoint, stop.
Restart from that checkpoint and continue.
Compare results against an uninterrupted reference run (bitwise or within acceptable numerical tolerance).

Partial restart:

Test restarting with a different number of processes if your design is supposed to support this.
Verify that domain redistribution and mapping are correctly handled.

Debugging tips:

Check that all critical state (time, iteration, seeds, parameters) is restored, not only the main fields.
Ensure that I/O buffering is flushed and files are closed before assuming a checkpoint is valid.

Performance and scalability considerations

Reducing I/O overhead

Compress checkpoint data (if CPU is cheaper than I/O time):

Lossless compression (e.g., gzip, zstd, HDF5 filters).
In some scientific contexts, lossy compression may be acceptable (e.g., floating-point compressors) but must be carefully evaluated.

Minimize data volume:

Remove redundant or recomputable data from checkpoints.
Use more compact data types where appropriate.

Align with filesystem characteristics:

Use large, contiguous writes.
Avoid many tiny writes and random access patterns.
Batch metadata updates; reduce number of checkpoint files where possible.

Staggered checkpointing

To avoid I/O storms when many jobs checkpoint simultaneously:

Randomize or stagger checkpoint times across jobs:

Add a small random offset to the checkpoint schedule.

Within a single application:

If using multiple checkpoints (e.g., multi-level strategies), offset them across process groups.

Some centers provide guidance on recommended checkpoint intervals and times based on overall system load.

Multi-level checkpointing

Multi-level strategies use several “tiers” of reliability and cost:

Level 1: Very frequent, cheap, local checkpoints (e.g., in-node memory-copy or SSD).

Protects against soft errors or application crashes that don’t involve node failure.

Level 2: Less frequent checkpoints to shared parallel filesystem.

Protects against node failures.

Level 3: Rare checkpoints to archival storage (e.g., burst buffers, tape).

Protects long-running campaigns or highly valuable simulations.

This hierarchical approach aims to match checkpoint cost to expected failure modes and frequency.

Integration with job scheduling and workflow

Checkpointing and wall-time limits

In batch systems with strict wall-time limits:

Plan checkpoint intervals so that:

You can safely write a checkpoint before wall time expires.
You do not lose too much computation if the job ends early (e.g., due to preemption).

Common strategy:

Estimate time needed for a checkpoint (possibly worst-case).
Ensure that your final checkpoint is initiated well before job end (e.g., 2–3 checkpoint durations before wall-time limit).

Some schedulers may:

Provide signals (e.g., SIGTERM some minutes before killing the job).
Allow jobs to catch these signals and trigger a last-minute checkpoint.

Checkpointing in multi-stage workflows

In workflows involving multiple jobs or stages:

Use checkpointing to:

Bridge between stages (e.g., restart from a saved state with different resolution or parameters).
Facilitate branching (e.g., restart the same state with different physics options).

Document:

Which checkpoint belongs to which stage.
How to move from checkpoint A to stage B.

Workflow managers and scripts can:

Automatically manage checkpoint naming.
Copy or archive older checkpoints.
Automate validation of successful restarts.

Practical guidelines and best practices

Start with a simple coordinated, full application-level checkpoint at clear iteration boundaries.
Choose an initial checkpoint interval based on:

Job length (target multiple checkpoints over the full run).
Reasonable overhead (e.g., checkpointing overhead < 5–10% of runtime).

Record checkpoint metadata (version, parameters, build info).
Design for incremental evolution:

Plan how checkpoint formats can evolve while remaining backward compatible or at least detect incompatibility cleanly.

Regularly test restart capability as part of your development and QA process.
Monitor:

Checkpoint sizes.
Checkpoint duration.
Impact on overall I/O performance (both for your job and at system scale).

Coordinate with the HPC support team:

Learn filesystem-specific recommendations.
Understand any site policies about checkpoint intervals and data retention.

A thoughtful checkpointing strategy turns failures and interruptions from catastrophic losses into manageable inconveniences, and is essential for reliable large-scale HPC runs.

Comments

Please login to add a comment.

Don't have an account? Register now!