Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a c...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft e...
© Springer International Publishing Switzerland 2016.As the threat of fault susceptibility caused by...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
In this paper, we address the issue of soft errors in random logic and develop solutions that provid...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Emerging high-performance computing platforms, with large component counts and lower power margins, ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE)...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft e...
© Springer International Publishing Switzerland 2016.As the threat of fault susceptibility caused by...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
International audienceSeveral recent papers have introduced a periodic verification mechanism to det...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
In this paper, we address the issue of soft errors in random logic and develop solutions that provid...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...