Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category techniques have been introduced to make these systems fault-tolerant. The first one is checkpoint-based technique and the other one is called log-based recovery protocol. Sender-based pessimistic logging which falls in the second category is harnessing from huge amount of messages payloads which must be kept in volatile memory. In this paper we present a Coordinated Checkpoint from Message Payload (CCMP) to reduce the aforementioned overhead. The proposed method was examined by MPICH-V2, a public domain platform implementing pessimistic logging with uncoordinated chec...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audienceA long-term trend in high-performance computing is the increasing number of no...
bouteill,lemarini,gk,fci lri.fr MPI is one of the most adopted programming models for Large Cluste...
International audienceWith the growing scale of high performance computing platforms, fault toleranc...
International audience— As reported by many recent studies, the mean time between failures of future...
International audienceTo execute MPI applications reliably, fault tolerance mechanisms are needed. M...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceGrid computing mutualizes more computing resources working in a calculation or...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audienceA long-term trend in high-performance computing is the increasing number of no...
bouteill,lemarini,gk,fci lri.fr MPI is one of the most adopted programming models for Large Cluste...
International audienceWith the growing scale of high performance computing platforms, fault toleranc...
International audience— As reported by many recent studies, the mean time between failures of future...
International audienceTo execute MPI applications reliably, fault tolerance mechanisms are needed. M...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceGrid computing mutualizes more computing resources working in a calculation or...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audienceA long-term trend in high-performance computing is the increasing number of no...