Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefore, the overheads of setting checkpoints and the re-computing time become a critical issue which directly impacts the system total overheads. Motivated by these concerns, this paper presents a new model by introducing i-checkpoints into the existing two-level checkpoint recovery scheme to deal with the more probable failures with the smaller cost and ...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
AbstractIt is important to design computer systems to tolerate some failures. This paper proposes tw...
Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tole...
<p>(a) The relationship between the total overheads of setting checkpoints and the number of failure...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
<p>(a) The relationship between the total overheads of setting checkpoints and the completion time; ...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
<p>(a) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 10%; (b) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 12...
<p>(a) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 10%; (b) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 30...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Performance evaluation of checkpoint rollback recovery strategies for distributed systems is a field...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
AbstractIt is important to design computer systems to tolerate some failures. This paper proposes tw...
Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tole...
<p>(a) The relationship between the total overheads of setting checkpoints and the number of failure...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
<p>(a) The relationship between the total overheads of setting checkpoints and the completion time; ...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
<p>(a) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 10%; (b) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 12...
<p>(a) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 10%; (b) <i>u = O<sub>i</sub>/O<sub>m</sub></i> = 30...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Performance evaluation of checkpoint rollback recovery strategies for distributed systems is a field...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...