Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to ma...
International audienceThis paper compares the performance of different approaches to tolerate failur...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Abstract—As the scale of high performance computing (HPC) continues to grow, application fault resil...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
AbstractExascale systems of the future are predicted to have mean time between failures (MTBF) of le...
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less ...
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
. Fault-tolerant programs are typically not only difficult to implement but also incur extra costs i...
Abstract. Fault-tolerant programs are typically not only difficult to implement but also incur extra...
Today’s software engineering and application development trend is to take advantage of reusable soft...
Abstract — Application robustness becomes a major concern with the continued scaling of high perform...
International audienceThis paper compares the performance of different approaches to tolerate failur...
International audienceThis paper compares the performance of different approaches to tolerate failur...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Abstract—As the scale of high performance computing (HPC) continues to grow, application fault resil...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
AbstractExascale systems of the future are predicted to have mean time between failures (MTBF) of le...
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less ...
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than ...
. Fault-tolerant programs are typically not only difficult to implement but also incur extra costs i...
Abstract. Fault-tolerant programs are typically not only difficult to implement but also incur extra...
Today’s software engineering and application development trend is to take advantage of reusable soft...
Abstract — Application robustness becomes a major concern with the continued scaling of high perform...
International audienceThis paper compares the performance of different approaches to tolerate failur...
International audienceThis paper compares the performance of different approaches to tolerate failur...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Abstract—As the scale of high performance computing (HPC) continues to grow, application fault resil...