Silent Data Corruption Resilient Matrix Factorizations on Distributed Memory System

Wu, Panruo

Publication date

January 2016

Publisher

eScholarship, University of California

Abstract

The lack of efficient resilience solutions is expected to be a major problem for the coming exascale supercomputers, as the chance that a long running large scale computation can finish without faults is diminishing quickly. In this dissertation I try to develop algorithmic techniques to provide fault tolerance for the commonly used matrix factorization algorithms and its high performance implementation in distributed memory massively parallel systems, with very low overhead and high scalability.Specifically, I design numerical error correcting encoding of matrix and the corresponding algorithms to tolerate hardware faults during matrix factorizations. It is in common with error correcting codes (ECC) used widely in communication and storag...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Silent Data Corruption Resilient Matrix Factorizations on Distributed Memory System

Abstract

Extracted data

Silent Data Corruption Resilient Matrix Factorizations on Distributed Memory System

Abstract

Extracted data

Related items

Related items