International audienceSeveral recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167–176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for erro...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
Several recent papers have introduced a periodic verification mechanism to detect silent errors i...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
This report describes a unified framework for the detection and correction of silent errors,which co...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being wid...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
Several recent papers have introduced a periodic verification mechanism to detect silent errors i...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
Several recent papers have introduced a periodic verification mechanism to detect silent errorsin it...
This report describes a unified framework for the detection and correction of silent errors,which co...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being wid...
International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficien...
International audienceFail-stop and silent errors are unavoidable on large-scale platforms. Efficien...
International audienceIn this paper, we revisit traditional checkpointing and rollback recovery stra...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...