14.3 Checkpointing strategies

Table of Contents

Introduction

Checkpointing is the practice of saving a program’s execution state to persistent storage so that it can later continue from that point instead of starting over. In HPC workflows, checkpointing is a central technique to tolerate failures, manage long runs within time limits, and support restarts and migration. This chapter focuses on concrete strategies for checkpointing in parallel and large scale environments, and how to choose and design them for real applications.

Goals and trade‑offs of checkpointing

Checkpointing always balances three main costs: the time spent writing checkpoints, the storage space consumed, and the amount of lost work after a failure. Writing very frequent, large checkpoints reduces lost work but can make the application I/O bound and fill the filesystem. Infrequent or tiny checkpoints reduce overhead but risk losing hours of computation.

A good checkpoint strategy minimizes total time to solution, which is the sum of:

Pure compute time.
Time spent writing checkpoints.
Time lost redoing work after failures.

The choice of strategy depends on job length, failure rates on the system, queue wait times, filesystem performance, and whether the application is tightly coupled and parallel, or mostly independent tasks.

Basic checkpointing patterns

At the simplest level, checkpointing can be automatic at the system level or manual inside the application code.

System level checkpointing tries to snapshot processes without application awareness. It can be attractive for legacy codes but often struggles with scalability and special resources such as GPUs or complex MPI states.

Application level checkpointing is done inside the code. The program carefully selects which data structures define the state, writes them to disk in a format it understands, and knows how to restore them. This approach has higher development cost but usually offers better performance and robustness at scale.

Within application level checkpointing, three basic patterns are common: full, incremental, and differential checkpointing.

Full checkpointing

Full checkpointing writes all necessary state that is required to restart the computation from a specific point. For a simple simulation this often includes the entire grid or particle data, time step counters, and parameters.

The main advantages are simplicity and reliability. Implementation is straightforward because the restart code reads one consistent image of the state. Debugging is easier because each checkpoint is self contained.

The cost is high I/O volume. For large problems, each full checkpoint can be many gigabytes or even terabytes, and writing such data may require seconds to minutes on shared filesystems. If full checkpoints are taken frequently, the I/O overhead can dominate runtime.

Full checkpointing is suitable when simulations are modest in size, storage is plentiful, or when checkpoints are infrequent, for example only at major milestones.

Incremental and differential checkpointing

Incremental and differential approaches try to reduce I/O volume by writing only parts of the state.

In incremental checkpointing the application records changes since the last checkpoint. At time step $n$, it writes only the data that has changed since checkpoint $n - 1$. To reconstruct the latest state, the restart procedure must apply a sequence of changes starting from a base checkpoint.

In differential checkpointing the application records changes relative to a fixed reference, often the previous full checkpoint. Every differential checkpoint is taken with respect to that base. Reconstruction requires the base image plus the latest differential information.

These methods can significantly reduce checkpoint size when the state changes slowly in space or time, or when only a small fraction of data is updated. For example, adaptive mesh refinement or sparse data structures may benefit.

The trade‑off is complexity. Restarting may require reading multiple files and combining them correctly. There is also an increased risk that corruption in an older file invalidates a long chain of dependent checkpoints. For this reason, many applications combine occasional full checkpoints with frequent incremental ones, resetting the chain from time to time.

Coordinated checkpointing in parallel programs

Parallel applications, especially MPI codes, must ensure that all processes agree on a consistent global state at each checkpoint. Inconsistent snapshots can lead to impossible states on restart, such as messages in transit without a sender or receiver in the stored state.

Coordinated checkpointing requires all processes in the parallel job to reach a checkpointing phase at the same global point in the computation. Typically processes synchronize at a well defined iteration boundary, then each writes its part of the state. The logical invariant is that the set of checkpoints, across all ranks, represents a single valid state of the whole system.

Because of this global coordination, failures are simpler to recover from. The application restarts with the same number of processes, each loading its own file or a portion of shared files, and then continues. Many large scale production codes use coordinated checkpointing, often coupled with parallel I/O libraries and file formats that can handle many writers efficiently.

The main drawback is the global pause at checkpoint time. Fast processes wait for slower ones to reach the checkpoint, and then all perform I/O simultaneously, which can stress the filesystem. At very large scale, this can limit how frequently checkpoints can be taken.

Uncoordinated and message logging strategies

Uncoordinated checkpointing allows processes to checkpoint independently without global synchronization. Each process takes checkpoints at its own schedule. To ensure correctness, this is often combined with message logging, where information about sent messages is recorded so that, upon restart, the communication history can be reconstructed.

In a pure uncoordinated strategy without any logging, restarts may lead to inconsistent states and even to a domino effect, where the failure of a single process forces the rollback of many others to very old checkpoints to reestablish a consistent global state.

With message logging, each process records metadata or payloads for outgoing messages. Several flavors exist, such as pessimistic logging, which ensures logs are always stored safely before messages are considered delivered, and optimistic logging, which trades some safety for performance. Upon failure, only the failed process is rolled back to its last checkpoint, and message logs are used to replay interactions with surviving processes so that the global state becomes consistent again.

These techniques are sophisticated and mostly appear in specialized research systems or advanced fault tolerant runtimes. They can reduce the cost of failures, since not all processes need to roll back, but they come with extra runtime overhead and more complex implementation.

Checkpoint frequency and optimal intervals

Choosing how often to checkpoint is crucial. Intuitively, if failures are rare, writing many checkpoints wastes time. If failures are frequent, writing few checkpoints risks losing large amounts of progress.

A classical model considers a mean time between failures, often written as $MTBF$, and a constant time to write a checkpoint, $C$. Under simplifying assumptions, one can derive an expression for an approximately optimal interval between checkpoints, $T_{opt}$. One common result is

$$T_{opt} \approx \sqrt{2 \cdot MTBF \cdot C}.$$

A common guideline for checkpoint frequency:
$$T_{opt} \approx \sqrt{2 \cdot MTBF \cdot C},$$
where $T_{opt}$ is the ideal time between checkpoints, $MTBF$ is the mean time between failures, and $C$ is the time to write a checkpoint.

This formula should be treated as a starting point rather than a strict rule. In practice, users rarely know the true $MTBF$ of a large system, and both checkpoint time and failure rates may vary. Still, the idea that the optimal interval scales with the square root of the product of failure time and checkpoint time is useful. Very long runs on less reliable systems should checkpoint relatively more often, especially when checkpoint I/O is fast.

Users often combine this analytic guidance with empirical observation. By timing checkpoints and monitoring the typical stability of the system, one can set intervals that balance overhead and risk. For example, a job expected to run for ten hours on a system whose queues and policies are known might checkpoint every 30 to 60 minutes, adjusted after some experience.

Placement and content of checkpoints

Deciding where in the code to place checkpoint operations is as important as deciding when to call them. Good checkpoint locations have clear semantic boundaries, such as the end of a time step, a nonlinear iteration, or a major phase of a workflow. At such boundaries, the program state is usually well defined, and it is easy to reconstruct control flow upon restart.

Checkpoints must include enough information to resume without ambiguity. This typically includes iteration counters, physical time of the simulation, configuration parameters, and any random number generator state so that stochastic processes behave correctly after restart. For adaptive algorithms, the checkpoint must capture the current mesh or decomposition, and mappings between global and local indices.

At the same time, checkpoints should avoid storing unnecessary transient data that can be recomputed cheaply. For example, derived fields that are simple functions of primary variables can be reconstructed after restart instead of written. This reduces size and write time.

Synchronous and asynchronous checkpointing

The most straightforward implementation is synchronous checkpointing, where the application stops computation, writes all checkpoint data, then resumes. In this scheme, computation and I/O do not overlap. Synchronous checkpointing is simple to reason about and implement but can lead to visible pauses, especially if each checkpoint takes many seconds to flush to disk.

Asynchronous checkpointing tries to overlap computation with I/O. The application copies checkpoint data into a buffer, then immediately continues computing while a separate thread, process, or I/O library transfers the buffered data to the filesystem. This hides part of the I/O time behind continued computation.

Asynchronous strategies are more complex because the program must ensure that data is not modified while it is still being written. Double buffering, careful use of threads, and nonblocking I/O APIs can help. In addition, the application must know at what point a checkpoint is fully and reliably stored so that it can be used for restart.

When implemented correctly, asynchronous checkpointing can significantly reduce perceived overhead, particularly in codes that have regular opportunities to prepare checkpoint data while other computation continues.

Local, shared, and hierarchical checkpoint storage

Where checkpoints are stored affects both performance and reliability. Several layers of storage often exist, from node local SSDs or RAM disks, to parallel filesystems, to off‑cluster archival systems.

Local checkpointing uses storage on the same node as the process. Access is typically fast and avoids contention with other jobs. However, if the entire node fails, local checkpoints may be lost. Local storage works well for short term resilience to process crashes or for intermediate checkpoints that can be periodically flushed to a more durable layer.

Shared checkpointing writes directly to a parallel filesystem that is available across nodes. This is the most common approach in production HPC runs, since restarts often occur on different nodes than the original run. The downside is that heavy concurrent I/O from many jobs can cause variability in performance, and very large, frequent checkpoints can stress the shared system.

Hierarchical checkpointing combines both. An application first writes quick checkpoints to node local storage, then a background process or postprocessing stage migrates important checkpoints to the parallel filesystem or even to archival storage. This reduces pressure on the shared system while keeping the ability to survive full node failures, although there is a window during which only local copies exist.

Multi‑level checkpointing

Multi‑level checkpointing extends the idea of hierarchical storage to incorporate different kinds of protection against different types of failure. For example, one can design:

Level 1 checkpoints: very frequent, stored in memory or local SSD, protecting against transient process failures or soft errors.

Level 2 checkpoints: less frequent, stored on the parallel filesystem, protecting against node failures or short system outages.

Level 3 checkpoints: rare, stored on remote or archival systems, protecting against severe incidents such as filesystem corruption or catastrophic outages.

Each level has its own cost and frequency. The application or runtime can adaptively choose which level to use based on expected failure modes and job length. For instance, for a job that runs a few hours during normal operation, only levels 1 and 2 might be used. For month long campaigns, level 3 archives might also be included.

Multi‑level strategies are particularly attractive at very large scale, where the probability of some failure during a long run is high, but writing all checkpoints to the most durable and expensive storage would be prohibitive.

Partial and selective checkpointing

Many codes run on heterogeneous resources such as CPUs plus GPUs, or have components that are more critical than others. Partial or selective checkpointing focuses on a subset of the application state.

In GPU accelerated applications, reconstructing GPU data from CPU data may be cheaper than writing GPU device memory directly. In such cases, checkpoints might include only host side arrays, even though some state resides on accelerators at run time. Alternatively, for codes that maintain large caches or derived data, it may be cheaper to reconstruct those from a smaller primary state, rather than store everything.

Selective checkpointing can also be used for multi‑component workflows. For example, a coupled simulation may consist of fluid dynamics, structural mechanics, and a postprocessing stage. If one component is deterministic and cheap, it may not need its own checkpoints, as long as its input from other components is recorded. This reduces total I/O volume.

The main risk of partial approaches is underestimating what is needed for a consistent restart. Careful analysis of dependencies is required to ensure that all essential data and control state are included.

Compressed and reduced precision checkpoints

Storage and I/O bandwidth can be conserved by reducing the size of checkpoint files through compression or reduced precision representations.

Compression can be lossless or lossy. Lossless compression preserves data exactly and is safer when bitwise reproducibility is essential. For smooth fields or structured data, lossless methods can still yield significant size reduction. However, compression itself costs CPU time, so there is a trade‑off between shrinking the checkpoint and spending extra time compressing and decompressing.

Lossy compression allows approximate storage of the state. If an application is tolerant to small perturbations, it may be acceptable to store a coarser representation of certain variables. This can greatly reduce sizes and I/O time. Restarted simulations will not be bitwise identical but can remain scientifically valid.

Reduced precision is a special kind of lossy strategy. For instance, double precision arrays can be stored as single precision in checkpoints, or certain less critical fields can be quantized. The effect is similar to compression and may simplify implementation when hardware and file formats already support multiple precisions.

In all such approaches, it is important to validate that the numerical behavior after restart remains acceptable, especially where long time integration or chaotic dynamics may amplify small differences.

User‑driven vs automatic checkpointing in workflows

In complex HPC workflows that consist of many stages, tasks, or ensemble runs, checkpointing strategies can either be implemented inside each application or orchestrated by an external workflow system.

User‑driven checkpointing places control in the hands of the application developer or user. The code decides when to write state and in what format. This provides fine grained control and often yields the most efficient solution for a specific algorithm.

Automatic workflow‑level checkpointing treats entire tasks or stages as units of work. The workflow manager records which tasks have completed and which intermediate outputs exist. A failure then triggers rerun of only the missing tasks, using existing outputs as implicit checkpoints. This is common for parameter sweeps, ensembles, or data processing pipelines, where individual tasks may be reexecuted cheaply.

In practice, many large projects combine both levels. Simulation codes implement internal restart capabilities, and workflow tools use these capabilities to manage long campaigns across many jobs and queues. Checkpointing strategies should therefore be designed with both individual runs and higher level orchestration in mind.

Practical considerations and common pitfalls

Several practical aspects strongly influence whether a checkpointing strategy works well on real clusters.

First, contention on shared filesystems can change over time. A strategy that appears efficient during off‑peak hours may perform poorly during busy periods. Adding some randomness or jitter to checkpoint timings can reduce synchronized bursts of I/O across many jobs.

Second, many job schedulers impose wall time limits. A safe strategy is to write a final checkpoint sufficiently before the wall time expires. For example, if checkpoints take approximately $C$ seconds, and the job end is at time $T_{end}$, the last checkpoint should be initiated at least $C$ plus some margin before $T_{end}$.

Third, file management becomes challenging when hundreds of checkpoints accumulate. It is often beneficial to keep only a small number of recent checkpoints, for example by periodically deleting older ones, as long as there is no need to roll back very far. Naming schemes that encode simulation time, iteration number, and checkpoint level can help with organization and automated cleanup.

Finally, checkpointing strategies should be tested explicitly. Restarting from several checkpoint ages during development helps catch missing state, file format errors, and subtle logic bugs before large production runs depend on them.

Summary

Checkpointing is a key ingredient of robust and efficient HPC workflows. Full, incremental, and differential techniques, combined with coordinated or more advanced uncoordinated approaches, offer a menu of options to handle failures and long jobs. Tuning checkpoint frequency, choosing storage layers, and deciding which data to store can dramatically affect performance and reliability. By designing checkpointing strategies that reflect the characteristics of both the application and the underlying HPC system, users can reduce wasted computation, meet wall time constraints, and run larger and more ambitious simulations with confidence.

Comments

Please login to add a comment.

Don't have an account? Register now!