Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems and fail-stop errors in the entire system, considering large component counts and lower power margins of emerging high-performance computing (HPC) platforms.To protect iterative methods from soft errors, we propose an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for iterative methods. We design a novel checksum-based encoding scheme for...
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft e...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
The probability that a failure will occur before the end of the computation increases as the number ...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft e...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
The probability that a failure will occur before the end of the computation increases as the number ...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Energy increasingly constrains modern computer hardware, yet protecting computations and data agains...
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft e...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...