In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discus-sion is given on the consequences of designing a fault-tol-erant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard
In this paper we describe the design of fault tolerance capabilities for general-purpose offload sem...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
In this paper we describe the design of fault tolerance capabilities for general-purpose offload sem...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
In this paper we describe the design of fault tolerance capabilities for general-purpose offload sem...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...