Design, Modeling, and Evaluation of a Scalable Multilevel Checkpointing System

Design, Modeling, and Evaluation of a Scalable Multilevel Checkpointing System,10.1109/SC.2010.18,Adam Moody,Greg Bronevetsky,Kathryn Mohror,Bronis R.

Design, Modeling, and Evaluation of a Scalable Multilevel Checkpointing System   (Citations: 5)
BibTex | RIS | RefWorks Download
High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with different costs and different levels of resiliency in a single run. This solution employs lightweight checkpoints to handle the most common failure modes and relies on more expensive checkpoints for less common, but more severe failures. This theoretically promising approach has not been fully evaluated in a large- scale, production system context. We have designed the Scalable Checkpoint/Restart (SCR) library, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.
Conference: Supercomputing Conference - SC , pp. 1-11, 2010
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...Multi-level checkpointing [7]–[9] allows applications to store lower-overhead, lessresilient checkpoints to stable storage and write the slowest but most resilient checkpoints to the parallel file system...

    Hui Liuet al. Algorithm-Based Recovery for Newton's Method without Checkpointing

    • ...Scalable Checkpointing/Restart(SCR) library was introduced recently to support diskless checkpointing at application level [14]...

    Hui Jinet al. REMEM: REmote MEMory as Checkpointing Storage

    • ...Optimizations of coordinated CR such as incremental [5], [6], non-blocking [7], diskless [8], RAID-inspired distributed and multi-level checkpointing [9] ameliorate some of its negative effects...
    • ...One example is a memory distributed mechanisms that saves checkpoint state redundantly across a distributed system in a RAID-like manner, and writes it to stable storage only when a failure occurs [9]...

    Maria Ruiz Varelaet al. Fault-tolerance for exascale systems

Sort by: