Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such compu-tations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable node and the possibility of losing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are appl...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
The probability that a failure will occur before the end of the computation increases as the number ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
The probability that a failure will occur before the end of the computation increases as the number ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
The probability that a failure will occur before the end of the computation increases as the number ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...