Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpo...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
Computational power demand for large challenging problems has increasingly driven the physical size ...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Long-running applications are often subject to failures. Once failures occur, it will lead to unacce...
International audience—The traditional single-level checkpointing method suffers from significant ov...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...
Abstract: Finding the failure rate of a system is a crucial step in high performance comput-ing syst...
Computational power demand for large challenging problems has increasingly driven the physical size ...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Long-running applications are often subject to failures. Once failures occur, it will lead to unacce...
International audience—The traditional single-level checkpointing method suffers from significant ov...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Next-generation exascale systems, those capable of performing a quintillion operations per second, ...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...