International audienceFailure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: ...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
International audienceFailure free execution will become rare in the future exascale computers. Thus...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likel...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
We consider the problem of bringing a distributed system to a consistent state after transient fail...
International audienceFailure free execution will become rare in the future exascale computers. Thus...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
This paper revisits replication coupled with checkpointing for fail-stop errors.Replication enables ...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likel...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
We consider the problem of bringing a distributed system to a consistent state after transient fail...