Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos' solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/...
Abstract—Increasing parallelism and transistor density, along with increasingly tighter energy and p...
Abstract—Increasing parallelism and transistor density, along with increasingly tighter energy and p...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
Actes del 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '15)...
Resilience is considered a challenging under-addressed issue that the high performance computing com...
International audience: The advent of extreme scale machines will require the use of parallel resour...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
International audienceAs the computational power of high performance computing (HPC) systems continu...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
AbstractExascale studies project reliability challenges for future high-performance computing (HPC) ...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Abstract—Increasing parallelism and transistor density, along with increasingly tighter energy and p...
Abstract—Increasing parallelism and transistor density, along with increasingly tighter energy and p...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
Actes del 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '15)...
Resilience is considered a challenging under-addressed issue that the high performance computing com...
International audience: The advent of extreme scale machines will require the use of parallel resour...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
International audienceAs the computational power of high performance computing (HPC) systems continu...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
AbstractExascale studies project reliability challenges for future high-performance computing (HPC) ...
International audienceThe advent of extreme scale machines will require the use of parallel resource...
Abstract—Increasing parallelism and transistor density, along with increasingly tighter energy and p...
Abstract—Increasing parallelism and transistor density, along with increasingly tighter energy and p...
As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft ...