International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution when the data is protected by it's own intrinsic properties, and can be algorith-mically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
proaches promise unparalleled scalability and performance in failure-prone environments. With the ad...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
proaches promise unparalleled scalability and performance in failure-prone environments. With the ad...
International audienceAlgorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalabil...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we design and analyze strategies to replicate the execution of ...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceThis paper revisits replication coupled with checkpointing for fail-stop error...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...