Scalable Diskless Checkpointing for Large Parallel Systems

Lu, Charng-da

Publication date

August 2005

Language

English

Abstract

Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which all processes coordinate to dump memory to stable storage simultaneously. However, in systems comprising tens of thousands of nodes, the total data volume can overwhelm the network and storage farm, creating an I/O bottleneck. Furthermore, a very large class of scientific applications can fail on these systems if one of the processes dies. Poor checkpointing performance limits checkpointing frequency and increases the time-to-solution of applications. Also, the application can spend more time in recovery and restart because large systems tend to fail often. Diskless checkpointing is a viable approach that provides high-performance and ...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Scalable Diskless Checkpointing for Large Parallel Systems

Abstract

Extracted data

Scalable Diskless Checkpointing for Large Parallel Systems

Abstract

Extracted data

Related items

Related items