Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension; (2) their checking scheme is not efficient due to redundant checksum verifications; (3) they fail to protect PCIe communication; and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second,...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...
Heterogeneous computing system with both CPUs and GPUs has become a class of widely used hardware ar...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
Deep learning technology has enabled the development of increasingly complex safety-related autonomo...
Extensive researches have been done on developing and optimizing algorithm-based fault tolerance (AB...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C =...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...
Heterogeneous computing system with both CPUs and GPUs has become a class of widely used hardware ar...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
Deep learning technology has enabled the development of increasingly complex safety-related autonomo...
Extensive researches have been done on developing and optimizing algorithm-based fault tolerance (AB...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
The lack of efficient resilience solutions is expected to be a major problem for the coming exascale...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
In this paper, we extend the theory of algorithmic fault-tolerant matrix-matrix mul-tiplication, C =...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
A fault-tolerant array for matrix multiplication that explicitly incorporates mechanisms for easy te...