This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for ...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss ...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
International audience: The advent of extreme scale machines will require the use of parallel resour...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceAs the computational power of high performance computing (HPC) systems continu...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for ...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss ...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
This report presents a method to recover from faults detected by hardware in numerical iterative sol...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
International audience: The advent of extreme scale machines will require the use of parallel resour...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceAs the computational power of high performance computing (HPC) systems continu...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for ...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss ...