Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications that require solving systems of lin-ear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Fail-ure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factor-izations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are...
The probability that a failure will occur before the end of the computation increases as the number ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
The probability that a failure will occur before the end of the computation increases as the number ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
The probability that a failure will occur before the end of the computation increases as the number ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...