International audienceFault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work ef...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
International audience— As reported by many recent studies, the mean time between failures of future...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audience— As reported by many recent studies, the mean time between failures of future...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
The high failure rate expected for future supercomputers requires the design of new fault tolerant s...
International audienceThe high failure rate expected for future supercomputers requires the design o...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
International audience— As reported by many recent studies, the mean time between failures of future...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
International audience— As reported by many recent studies, the mean time between failures of future...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
The high failure rate expected for future supercomputers requires the design of new fault tolerant s...
International audienceThe high failure rate expected for future supercomputers requires the design o...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
International audience— As reported by many recent studies, the mean time between failures of future...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...