In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard compliant, an MPI stack must support to provide the necessary fault tolerance guarantees, based on MPI's dynamic process management. Our results, including synthetic benchmarks and applications, reveal low runtime overhead and efficient recovery, demonstrating that the existing MPI standard provided us with sufficient mechanisms to implement an effective and efficient fault-tolerant solution.This research received funding from the European Community’s 7th Framework Programme via the DEEP-ER pro...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
In this paper we describe the design of fault tolerance capabilities for general-purpose offload sem...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Long-running MPI applications on clusters and grids that are prone to node and network failures, mot...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
In this paper we describe the design of fault tolerance capabilities for general-purpose offload sem...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Long-running MPI applications on clusters and grids that are prone to node and network failures, mot...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract—Exascale targeted scientific applications must be prepared for a highly concurrent computin...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...