Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Because of increasing hardware and software complexity, the running time of many computational scien...
Because of increasing hardware and software complexity, the running time of many computational scie...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Because of increasing hardware and software complexity, the running time of many computational scien...
Because of increasing hardware and software complexity, the running time of many computational scie...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
Abstract. The MPI standard lacks semantics and interfaces for sus-tained application execution in th...
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...