Checkpointing is today's common mean for dealing with transient failures in supercomputers. However, the effectiveness of checkpointing and recovery protocols under the assumption that failures may happen during their operation is not well understood. We present an evaluation of the checkpointing and recovery based on Sender Based Message Logging protocols (SBML). We evaluate it by means of a model which is gathered from an extensive field data campaign performed on the SCOPE supercomputer at the University of Naples. A comprehensive model is built to evaluate reliability, scalability and performance of SBML. The proposed model takes into account failures during the checkpointing and recovery. Result provide insights on the limit of the num...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...
Checkpointing is today's common mean for dealing with transient failures in supercomputers. However,...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
Abstract—A look at Exascale reveals a future with multicore supercomputers that will inexorably expe...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Message logging and checkpointing can provide fault tolerance in distributed systems in which all pr...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper introduces an effective communication-induced checkpointing protocol using message loggin...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...
Checkpointing is today's common mean for dealing with transient failures in supercomputers. However,...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
Abstract—A look at Exascale reveals a future with multicore supercomputers that will inexorably expe...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
Message logging and checkpointing can provide fault tolerance in distributed systems in which all pr...
This survey covers rollback-recovery techniques that do not require special language constructs. In ...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
This paper introduces an effective communication-induced checkpointing protocol using message loggin...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
.... Abstract a process is logged on stable storage [5], and each process is occasionally checkpoint...