The increasing number of cores on current supercomputers will quickly decrease the mean time to failures (MTTF) of the system. With such high failure rates, long time running applications will have little chance to complete successfully if they don’t use any fault tolerance strategy. Double in memory/disk checkpointing is a production fault tolerance strategy in Charm++ runtime system. Each node will store one copy of its checkpoint in its own memory or disk as a local checkpoint and another copy in other node’s memory or disk as a global checkpoint. This method takes advantage of the relatively high network bandwidth compared to I/O bandwidth. It is able to store a checkpoint faster than the traditional NFS- based checkpoint/restart. How...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the ...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) p...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
As we move to large manycores, the hardware-based global checkpointing schemes that have been propo...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the ...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) p...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
As we move to large manycores, the hardware-based global checkpointing schemes that have been propo...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...