International audienceGrid computing mutualizes more computing resources working in a calculation or a common task. The increase in the number of components in the system leads also increases the number of fault. These failures result in a loss of several cycles of running applications. It is therefore essential to be able to tolerate faults so that the computation can continue to execute and finish despite failures, all while maintaining maximum performance. One advantage of coordinated checkpoint is its capacity to have a very low overhead as long as the execution stays fault free. On the contrary, due to the fact that uncoordinated checkpoint requires being complemented by a message log protocol, this adds a significant penalty for all m...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
This paper presents a new checkpointing coordination scheme which utilizes the communication pattern...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceGrid infrastructure is a large set of nodes geographically distributed and con...
International audienceFault tolerance is a key issue in grid systems. But few works have been done t...
bouteill,lemarini,gk,fci lri.fr MPI is one of the most adopted programming models for Large Cluste...
International audienceGrid infrastructure is a large set of nodes geographically distributed and con...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousand...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
National audienceIn order to execute without modi cation Message Passing distributed applications on...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
This paper presents a new checkpointing coordination scheme which utilizes the communication pattern...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceGrid infrastructure is a large set of nodes geographically distributed and con...
International audienceFault tolerance is a key issue in grid systems. But few works have been done t...
bouteill,lemarini,gk,fci lri.fr MPI is one of the most adopted programming models for Large Cluste...
International audienceGrid infrastructure is a large set of nodes geographically distributed and con...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousand...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
National audienceIn order to execute without modi cation Message Passing distributed applications on...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
This paper presents a new checkpointing coordination scheme which utilizes the communication pattern...