AbstractExascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application executi...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
Abstract—Supercomputers have seen an exponential increase in their size in the last two decades. Suc...
International audienceFT-GReLoSSS (FTG) is a C++/MPI framework to ease the development of fault-tole...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, ...
International audienceThis paper compares the performance of different approaches to tolerate failur...
International audienceThis paper compares the performance of different approaches to tolerate failur...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
Cyber-physical systems frequently have to use massive redundancy to meet application requirements fo...
Cyber-physical systems frequently have to use massive redundancy to meet application requirements fo...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
Abstract—Supercomputers have seen an exponential increase in their size in the last two decades. Suc...
International audienceFT-GReLoSSS (FTG) is a C++/MPI framework to ease the development of fault-tole...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, ...
International audienceThis paper compares the performance of different approaches to tolerate failur...
International audienceThis paper compares the performance of different approaches to tolerate failur...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
Cyber-physical systems frequently have to use massive redundancy to meet application requirements fo...
Cyber-physical systems frequently have to use massive redundancy to meet application requirements fo...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
Abstract—Supercomputers have seen an exponential increase in their size in the last two decades. Suc...
International audienceFT-GReLoSSS (FTG) is a C++/MPI framework to ease the development of fault-tole...