International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave secti...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceParallel execution time is expected to decrease as the number of processors in...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
proaches promise unparalleled scalability and performance in failure-prone environments. With the ad...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceParallel execution time is expected to decrease as the number of processors in...
International audienceLarge scale applications running on new computing plat- forms with thousands o...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
proaches promise unparalleled scalability and performance in failure-prone environments. With the ad...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
This report provides an introduction to the design of scheduling algorithms to cope with faults on l...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables...
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enable...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
International audienceParallel execution time is expected to decrease as the number of processors in...
International audienceLarge scale applications running on new computing plat- forms with thousands o...