High Performance Computing Systems with Various Checkpointing Schemes

Naksinehaboon, Nichamon
P[un, Mihaela
Nassar, Raja
Leangsuksun, Chokchai Box
Scott, Stephen

Publication date

December 2009

Publisher

Agora University Press

Abstract

Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpo...

Extracted data

We use cookies to provide a better user experience.

Data Protection

High Performance Computing Systems with Various Checkpointing Schemes

Abstract

Extracted data

High Performance Computing Systems with Various Checkpointing Schemes

Abstract

Extracted data

Topics

Related items

Topics

Related items