Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources ∗ ABSTRACT

Zizhong Chen

Publication date

February 2008

Abstract

As the desire of scientists to perform ever larger computations drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead, especially when applications modify a large mount of memory between checkpoints. This paper presents an algorithm-based checkpoint-free fault tolerance approach in which, instead of taking checkpoints periodically, a coded global consistent state of the critical application data is maintained in memory by modifying applications to operate on encoded data. Altho...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources ∗ ABSTRACT

Abstract

Extracted data

Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources ∗ ABSTRACT

Abstract

Extracted data

Related items

Related items