Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to show which detector(s) to use, and to characterize the optimal computational pattern for the application: how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a re...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
As high-performance computing (HPC) continues to progress, constraints on HPC system design forces t...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceMany methods are available to detect silent errors in high-performance computi...
International audienceMany methods are available to detect silent errors in high-performance computi...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
International audienceThis chapter describes a unified framework for the detection and correction of...
International audienceSilent errors, or silent data corruptions, constitute a major threat on very l...
As high-performance computing (HPC) continues to progress, constraints on HPC system design forces t...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Resilience has become a critical problem for high performance computing. Checkpointing proto-cols ar...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...