With increasing numbers of processors on todays machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becomin more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications are presented. An example of a fault-tolerant paralle...
Today’s high performance computing systems are made possible by multiple increases in hardware paral...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increa...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Today’s high performance computing systems are made possible by multiple increases in hardware paral...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
With increasing numbers of processors on todays ma-chines, the probability for node or link failures...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
In this paper we discuss the design and use of a fault-tol-erant MPI (FT-MPI) that handles process f...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increa...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
Today’s high performance computing systems are made possible by multiple increases in hardware paral...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...