The massive scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. In particular, the standard approach to fault tolerance, application-directed checkpointing, puts an incredible strain on the storage system and the interconnection network. This results in overheads on the appliation that severely impact performance and scalability. The checkpoint overhead can be reduced by decreasing the checkpoint latency, which is the time to write a checkpoint file, or by increasing the checkpoint interval, which is the compute time between writing checkpoint files. However, increasing the checkpoint interval may increase execution time in the presence of failures. The...
This short paper deals with parallel scientific applications using non-blocking and periodic coordin...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Checkpointing and rollback is a technique to minimize the loss of computation in the presence of fai...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
This short paper deals with parallel scientific applications using non-blocking and periodic co-ordi...
As computational clusters rapidly grow in both size and complexity, system reliability and, in parti...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This short paper deals with parallel scientific applications using non-blocking and periodic coordin...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
The massive scale of current and next-generation massively parallel processing (MPP) systems present...
The large scale of current and next-generation massively parallel processing (MPP) systems presents ...
Checkpointing and rollback is a technique to minimize the loss of computation in the presence of fai...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
This short paper deals with parallel scientific applications using non-blocking and periodic co-ordi...
As computational clusters rapidly grow in both size and complexity, system reliability and, in parti...
Parallel computing systems provide hardware redundancy that helps to achieve low cost fault- toleran...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures th...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This short paper deals with parallel scientific applications using non-blocking and periodic coordin...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...