In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we theoretically show that the methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead rollback approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix mul-tiplication that incorporates these error detection and correction methods. Empirical results demonstrate that the methods work well in practice with an acceptable level of overhead relative to high-performance implementations without fault-toleranc...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
[[abstract]]Existing fault-tolerant matrix-inversion schemes suffer several drawbacks, such as being...
International audienceThis paper compares several fault-tolerance methods for the detection and corr...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...
AbstractAn approach to design fault-tolerant hexagonal systolic array (SA) for multiplication of rec...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
In this study, we propose a simple method for fault-tolerant Strassen-like matrix multiplications. T...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
[[abstract]]Existing fault-tolerant matrix-inversion schemes suffer several drawbacks, such as being...
International audienceThis paper compares several fault-tolerance methods for the detection and corr...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...
AbstractAn approach to design fault-tolerant hexagonal systolic array (SA) for multiplication of rec...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
In this study, we propose a simple method for fault-tolerant Strassen-like matrix multiplications. T...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
[[abstract]]Existing fault-tolerant matrix-inversion schemes suffer several drawbacks, such as being...