The lack of efficient resilience solutions is expected to be a major problem for the coming exascale supercomputers, as the chance that a long running large scale computation can finish without faults is diminishing quickly. In this dissertation I try to develop algorithmic techniques to provide fault tolerance for the commonly used matrix factorization algorithms and its high performance implementation in distributed memory massively parallel systems, with very low overhead and high scalability.Specifically, I design numerical error correcting encoding of matrix and the corresponding algorithms to tolerate hardware faults during matrix factorizations. It is in common with error correcting codes (ECC) used widely in communication and storag...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
This dissertation details contributions made by the author to the field of computer science while wo...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
As an increasing number of modern big data systems utilize horizontal scaling,the general trend in t...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
none3noAs large-scale linear equation systems are pervasive in many scientific fields, great efforts...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
This dissertation details contributions made by the author to the field of computer science while wo...
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix facto...
Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications tha...
Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications t...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
As an increasing number of modern big data systems utilize horizontal scaling,the general trend in t...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Abstract—Modeling and analysis of large scale scientific systems often use linear least squares regr...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications...
none3noAs large-scale linear equation systems are pervasive in many scientific fields, great efforts...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
As the desire of scientists to perform ever larger computations drives the size of today’s high perf...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
This dissertation details contributions made by the author to the field of computer science while wo...