Communicated by Akihiro Fujiwara Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach based upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kale ́ [23], with the non-blocking algorithm of Ni, Meneses and Kale ́ [15] in terms of both performance and risk. We also extend the model proposedcan provide a better efficiency in [23, 15] to assess the impact of the overhead associated to non-blocking communications. In addition, we deal with arbitrary failure distributions (as opposed to uniform distributions in [23]). We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work without additional memory, ...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the ...
International audienceA long-term trend in high-performance computing is the increasing number of no...
AbstractThere are two approaches to reduce the overhead associated with coordinated checkpointing: f...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
This paper shows how Koo and Toueg\u27s distributed checkpointing algorithm can be modified so as to...
Cooperative checkpointing uses global knowledge of the state and health of the machine to improve pe...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
Due to the character of the original source materials and the nature of batch digitization, quality ...
This paper examines the performance of synchronous checkpointing in a distributed computing environm...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the ...
International audienceA long-term trend in high-performance computing is the increasing number of no...
AbstractThere are two approaches to reduce the overhead associated with coordinated checkpointing: f...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
This paper shows how Koo and Toueg\u27s distributed checkpointing algorithm can be modified so as to...
Cooperative checkpointing uses global knowledge of the state and health of the machine to improve pe...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
Due to the character of the original source materials and the nature of batch digitization, quality ...
This paper examines the performance of synchronous checkpointing in a distributed computing environm...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...