Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum’s Fault Tolerance Working Group is to en-hance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the Run-Through Stabilization proposal. This proposal allows an application to continue execution even if MPI processes fail during execution. The discussion introduces the implications on point-to-point and collective operations over communicators, though the full proposal addresses all aspects of the MPI standard
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Because of increasing hardware and software complexity, the running time of many computational scie...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improv...
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their h...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Because of increasing hardware and software complexity, the running time of many computational scie...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improv...
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their h...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Because of increasing hardware and software complexity, the running time of many computational scie...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...