LANL Technical Release LA-UR 09-02117 PLFS: A Checkpoint Filesystem for Parallel Applications

Publication date

January 2015

Abstract

Parallel applications running across thousands of processors must protect themselves from inevitable com-ponent failures. Many applications insulate themselves from failures by checkpointing, a process in which they save their state to persistent storage. Following a failure, they can resume computation using this state. For many applications, saving this state into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mis...

Extracted data

We use cookies to provide a better user experience.

Data Protection

LANL Technical Release LA-UR 09-02117 PLFS: A Checkpoint Filesystem for Parallel Applications

Abstract

Extracted data

LANL Technical Release LA-UR 09-02117 PLFS: A Checkpoint Filesystem for Parallel Applications

Abstract

Extracted data

Related items

Related items