With the growing scale of HPC applications, there has been an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Time Between Failures (MTBF) in current systems encourages the research of suitable fault tolerance solutions. Message logging combined with uncoordinated checkpoint compose a scalable rollback-recovery solution. However, message logging techniques are usually responsible for most of the overhead during failure-free executions. Taking this into consideration, this paper proposes the Hybrid Message Pessimistic Logging (HMPLHMPL) which focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low failure-free overhead introduced ...
International audienceTo execute MPI applications reliably, fault tolerance mechanisms are needed. M...
Message logging is a popular technique for building systems that can tolerate process crashes and tr...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
AbstractWith the growing scale of High Performance Computing applications comes an increase in the n...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceWith the growing scale of high performance computing platforms, fault toleranc...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
International audienceGrid computing mutualizes more computing resources working in a calculation or...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceTo execute MPI applications reliably, fault tolerance mechanisms are needed. M...
Message logging is a popular technique for building systems that can tolerate process crashes and tr...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
AbstractWith the growing scale of High Performance Computing applications comes an increase in the n...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceWith the growing scale of high performance computing platforms, fault toleranc...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
International audienceGrid computing mutualizes more computing resources working in a calculation or...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceTo execute MPI applications reliably, fault tolerance mechanisms are needed. M...
Message logging is a popular technique for building systems that can tolerate process crashes and tr...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...