This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery t...
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for ...
International audience: The advent of extreme scale machines will require the use of parallel resour...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceAs the computational power of high performance computing (HPC) systems continu...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
Several recent papers have introduced a periodic verification mechanism to detect silent errors i...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
Large scale simulations are used in a variety of application areas in science and engineering to hel...
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for ...
International audience: The advent of extreme scale machines will require the use of parallel resour...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceAs the computational power of high performance computing (HPC) systems continu...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
Several recent papers have introduced a periodic verification mechanism to detect silent errors i...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
Large scale simulations are used in a variety of application areas in science and engineering to hel...
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for ...
International audience: The advent of extreme scale machines will require the use of parallel resour...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...