International audienceThe move towards exascale super-computers requires new fault tolerance solutions. Regarding parallel message passing applications, existing rollback-recovery protocols are not suited. To be able to deal with very large scale applications and high failure rate, a protocol should be able to confine failures consequences to a small subset of the processes, while providing good failure free performance, and logging a limited amount of data, especially in memory. To fulfill these needs, we propose HydEE, a hierarchical rollback-recovery protocol that combines coordinated checkpointing and message logging. HydEE leverages the send-determinism of scienfitic parallel applications to tolerate multiple failures without rely...
The concept of backward recovery is now well established as a means of restoring a consistent state ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
International audienceThe move towards exascale super-computers requires new fault tolerance solutio...
International audienceHigh performance computing will probably reach exascale in this decade. At thi...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceThe high failure rate expected for future supercomputers requires the design o...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
With the evolution of parallel computers, the need for fault tolerance protocols is becoming increas...
The concept of backward recovery is now well established as a means of restoring a consistent state ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
International audienceThe move towards exascale super-computers requires new fault tolerance solutio...
International audienceHigh performance computing will probably reach exascale in this decade. At thi...
Processor failures in post-petascale settings are common occurrences. The traditional fault-toleranc...
ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
International audienceThe high failure rate expected for future supercomputers requires the design o...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
With the evolution of parallel computers, the need for fault tolerance protocols is becoming increas...
The concept of backward recovery is now well established as a means of restoring a consistent state ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...