Kahibaro
Discord Login Register

Checkpointing strategies

Goals of checkpointing

Checkpointing is about periodically saving enough state of a running application so that, after a failure or interruption, you can restart from a recent point instead of from the very beginning. Strategy is about deciding:

A good checkpointing strategy balances:

Types of checkpointing

Application-level vs. system-level

Application-level checkpointing

System-level (transparent) checkpointing

In HPC, application-level checkpointing is common for large-scale numerical codes; system-level is often used in research settings or as a safety net.

Coordinated vs. uncoordinated (distributed-memory codes)

For MPI and other distributed-memory applications, each process must produce checkpoints that together represent a consistent global state (no messages “in flight” that might be lost).

Coordinated checkpointing

Uncoordinated checkpointing

Practical takeaway: most production HPC codes use coordinated, application-level checkpoints at well-defined iteration or time-step boundaries.

Full vs. incremental vs. differential

Full checkpoints

Incremental checkpoints

Differential checkpoints

For many codes, the first step is simple full checkpoints, then consider incremental if I/O overhead becomes a bottleneck.

Synchronous vs. asynchronous checkpointing

Synchronous checkpointing

Asynchronous checkpointing

When to checkpoint: frequency and policies

Fixed-interval strategies

Most beginners start with a simple rule such as:

This is easy to configure and reason about, but may not be optimal for:

Analytical trade-off: checkpoint interval vs. failure rate

There is a well-known trade-off between:

A classical approximation for the optimal time between checkpoints $T_{\text{opt}}$ is based on:

A commonly used formula (Young/Daly approximation) is:

$$
T_{\text{opt}} \approx \sqrt{2 \, C \, MTBF}
$$

Interpretation:

In practice:

Adaptive strategies

More advanced strategies adapt checkpoint frequency based on:

Examples:

Phase-aware checkpointing

Some applications have distinct phases:

Strategies:

What to checkpoint: selecting state

The key is to checkpoint only what is necessary and sufficient to reconstruct the computation.

Common components:

Often not needed (or better reconstructed):

A useful practice is to document in code what is considered checkpoint state and maintain this list as the code evolves.

For distributed-memory codes:

Where to store checkpoints

Local vs. shared storage

Local storage (node-local disks, SSDs, NVMe, RAM disks)

Shared/parallel filesystem (e.g., Lustre, GPFS)

A common hybrid strategy:

Redundancy and replication

To protect against storage failures:

Trade-offs:

Policy examples:

Structuring checkpoint files

Per-rank vs. collective checkpoint files

Per-rank (one file per MPI task)

Collective (few or one global file)

A common compromise:

Versioning and metadata

Include metadata in or alongside checkpoints:

Benefits:

Naming schemes:

Consistency and correctness

Achieving consistent global state

For parallel codes, a checkpoint should represent a state that could have arisen in a normal execution without checkpointing.

Key points:

Simple practice:

Restart validation

To ensure the strategy is correct:

Debugging tips:

Performance and scalability considerations

Reducing I/O overhead

Staggered checkpointing

To avoid I/O storms when many jobs checkpoint simultaneously:

Some centers provide guidance on recommended checkpoint intervals and times based on overall system load.

Multi-level checkpointing

Multi-level strategies use several “tiers” of reliability and cost:

This hierarchical approach aims to match checkpoint cost to expected failure modes and frequency.

Integration with job scheduling and workflow

Checkpointing and wall-time limits

In batch systems with strict wall-time limits:

Some schedulers may:

Checkpointing in multi-stage workflows

In workflows involving multiple jobs or stages:

Workflow managers and scripts can:

Practical guidelines and best practices

A thoughtful checkpointing strategy turns failures and interruptions from catastrophic losses into manageable inconveniences, and is essential for reliable large-scale HPC runs.

Views: 18

Comments

Please login to add a comment.

Don't have an account? Register now!