With increasing numbers of processors on todays ma-chines, the probability for node or link failures is also in-creasing. Therefore, application level fault-tolerance is be-coming more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the pos-sibility to recover from a node or link error and continue ex-ecution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applica-tions are presented. An example of a fault-tolerant ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Today’s high performance computing systems are made possible by multiple increases in hardware paral...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increa...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Today’s high performance computing systems are made possible by multiple increases in hardware paral...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increa...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Today’s high performance computing systems are made possible by multiple increases in hardware paral...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...