Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing systems analysis. To deal with this problem, a fault tolerant mechanism, called check-point/restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental check-point schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incre-mental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Long-running applications are often subject to failures. Once failures occur, it will lead to unacce...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
We study programs which operate in the presence of possible failures and which must be restarted fro...
AbstractIt is important to design computer systems to tolerate some failures. This paper proposes tw...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
Long-running applications are often subject to failures. Once failures occur, it will lead to unacce...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
We study programs which operate in the presence of possible failures and which must be restarted fro...
AbstractIt is important to design computer systems to tolerate some failures. This paper proposes tw...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Checkpoints are widely used to improve the performance of computer systems and programs in the prese...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...