Abstract. Using rollback-recovery based fault tolerance (FT) techniques in applications executed on Multicore Clusters is still a challenge, be-cause the overheads added depend on the applications ’ behavior and resource utilization. Many FT mechanisms have been developed in re-cent years, but analysis is lacking concerning how parallel applications are affected when applying such mechanisms. In this work we address the combination of process mapping and FT tasks mapping on multi-core environments. Our main goal is to determine the configuration of a pessimistic receiver-based message logging approach which generates the least disturbance to the parallel application. We propose to charac-terize the parallel application in combination with t...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
Using rollback-recovery based fault tolerance (FT) techniques in applications executed on Multicore ...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large...
Abstract—A look at Exascale reveals a future with multicore supercomputers that will inexorably expe...
We present a unified fault-tolerance framework for task-parallel message-passing applications to mit...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
With the growing scale of HPC applications, there has been an increase in the number of interruption...
Using rollback-recovery based fault tolerance (FT) techniques in applications executed on Multicore ...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
Abstract—The predicted failure rates of future supercom-puters loom the groundbreaking research larg...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
ArticuloThe predicted failure rates of future supercomputers loom the groundbreaking research large...
Abstract—A look at Exascale reveals a future with multicore supercomputers that will inexorably expe...
We present a unified fault-tolerance framework for task-parallel message-passing applications to mit...
Abstract—Computing systems will grow significantly larger in the near future to satisfy the needs of...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
The era of petascale computing brought machines with hundreds of thousands of processors. The next g...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
Rollback techniques that use message logging and deterministic replay can be used in parallel system...
With the growing scale of HPC applications, there has been an increase in the number of interruption...