One of the topics of paramount importance in the development of Cluster and Grid middleware is the impact of faults since their occurrence probability in a Grid infrastructure and in large-scale distributed system is actually very high. MPI (Message Passing Interface) is a popular abstraction for programming distributed computation applications. FAIL is an abstract language for fault occurrence description capable of expressing complex and realistic fault scenarios. In this paper, we investigate the possibility of using FAIL to inject faults in a fault-tolerant MPI implementation. Our middleware, FAIL-MPI, is used to carry quantitative and qualitative faults and stress testing
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputin...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
One of the topics of paramount importance in the development of Grid middleware is the impact of fau...
International audienceIn a network consisting of several thousands computers, the occurrence of faul...
International audienceWe present in this paper a study on fault management in a grid middleware. The...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
In a network consisting of several thousands computers, the occurrence of faults is unavoid- able. B...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
Selected for publication in the post-conference bookComputing grids are large-scale, highly-distribu...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputin...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract. This paper examines the topic of writing fault-tolerant MPI applications. We discuss the m...
One of the topics of paramount importance in the development of Grid middleware is the impact of fau...
International audienceIn a network consisting of several thousands computers, the occurrence of faul...
International audienceWe present in this paper a study on fault management in a grid middleware. The...
Scientific applications have long embraced the MPI as the environment of choice to execute on large ...
In a network consisting of several thousands computers, the occurrence of faults is unavoid- able. B...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
Selected for publication in the post-conference bookComputing grids are large-scale, highly-distribu...
With increasing numbers of processors on todays machines, the probability for node or link failures ...
With increasing numbers of processors on current machi-nes, the probability for node or link failure...
With increasing numbers of processors on current ma-chines, the probability for node or link failure...
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputin...