International audienceWe present in this paper an evaluation of fault management in the grid middleware P2P-MPI. One of P2P-MPI's objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. In this paper, we report results from several experiments which show the overhead of replication, and the cost of faul...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
Due to the character of the original source materials and the nature of batch digitization, quality ...
International audienceWe present in this paper a study on fault management in a grid middleware. The...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with th...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract. The Grid community has made an important effort in developing middleware to provide differ...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
International audienceThis chapter describes the P2P-MPI project, a software framework aimed at the ...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
Due to the character of the original source materials and the nature of batch digitization, quality ...
International audienceWe present in this paper a study on fault management in a grid middleware. The...
One of the topics of paramount importance in the development of Cluster and Grid middleware is the i...
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with th...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
Abstract. The Grid community has made an important effort in developing middleware to provide differ...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC...
Abstract—As recent research has demonstrated, it is be-coming a necessity for large scale applicatio...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
International audienceThis chapter describes the P2P-MPI project, a software framework aimed at the ...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
With the increasing number of processors in modern HPC(High Performance Computing) systems (65536 in...
Abstract. As the scale of computing platforms becomes increasingly extreme, the requirements for app...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
Due to the character of the original source materials and the nature of batch digitization, quality ...