Run-through stabilization: An MPI proposal for process fault tolerance

Joshua Hursey
Richard L. Graham
Greg Bronevetsky
Darius Buntinas
Howard Pritchard
David G. Solt

Open link

Publication date

January 2011

DOI

10.1007/978-3-642-24449-0_40

ISSN

0302-9743

Citation count (estimate)

Abstract

Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum’s Fault Tolerance Working Group is to en-hance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard

Extracted data

We use cookies to provide a better user experience.

Data Protection

Run-through stabilization: An MPI proposal for process fault tolerance

Abstract

Extracted data

Run-through stabilization: An MPI proposal for process fault tolerance

Abstract

Extracted data

Related items

Related items