Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squa-res problems. Such computations are normally carried out on supercomputers where the ever-growing scale induces a fast decrease of the Mean Time To Failure (MTTF). This pa-per proposes a new algorithm-based fault tolerant (ABFT) approach, designed to survive fail-stop failures during dense matrix factorizations in extreme conditions such as the ab-sence of any reliable components, and the possibility of loos-ing both data and checksum from a single failure. Both left and right factorization results are protected by ABFT algo-rithms, and fault-tolerant algorithms ...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
In high-performance systems, the probability of failure is higher for larger systems. The probabili...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
AbstractIn the multi-peta-flop era for supercomputers, the number of computing cores is growing expo...
An important consideration in the design of high performance multiprocessor systems is to ensure the...