We present a fault model designed to bring out the “worst” in iterative solvers based on mathematical properties. Our model introduces substantially higher overhead, but smaller variance, than a fault model based on random bit flips. We also relate the statistics from our experiments back to the solvers ’ configuration, and briefly address the computational effort that each model requires. Our approach requires sig-nificantly fewer resources, while punishing our solvers with undetectable errors that require notable overhead for recov-ery. This work also illustrates the robustness of our resilient algorithms: Not only do we make forward progress in the presence of pathological faults, we always obtain the correct answer
With the advent of exascale computing and the realization that memory errors will be an ever importa...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
Soft errors are increasing in modern computer systems. These faults can corrupt the results of nume...
Soft errors caused by transient bit flips have the potential to significantly impactan applicalion's...
Actes del 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '15)...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Resilience is considered a challenging under-addressed issue that the high performance computing com...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
With the advent of exascale computing and the realization that memory errors will be an ever importa...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
Soft errors are increasing in modern computer systems. These faults can corrupt the results of nume...
Soft errors caused by transient bit flips have the potential to significantly impactan applicalion's...
Actes del 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '15)...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
International audienceIn this talk we will discuss possible numerical remedies to survive data loss...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
Resilience is considered a challenging under-addressed issue that the high performance computing com...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
With the advent of exascale computing and the realization that memory errors will be an ever importa...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...