Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analyt...
Abstract. This is an on-going work that aims at deriving metrics that represent fault resilience in ...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale...
Monolithic applications are gradually getting replaced by systems built after the emerging microserv...
Resiliency is becoming an important service attribute for large scale distributed systems and networ...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
Research on resilient systems extends classical system analysis, modeling and simulation approaches....
International audienceFast evolution of computing systems is still a challenge today, but it is beco...
This paper presents a framework to assess and improve the resilience of a production system by ident...
Abstract. Deriving fault-tolerant schedulability resilience for real-time systems has been a challen...
International audienceResilience is a critical problem for extreme scale numerical simulations. The ...
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the...
Despite the central importance of crew safety in designing and operating a life support system, the ...
A simulation-based approach to measuring the faultresilience of real-time systems is presented. Simu...
Abstract. This is an on-going work that aims at deriving metrics that represent fault resilience in ...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale...
Monolithic applications are gradually getting replaced by systems built after the emerging microserv...
Resiliency is becoming an important service attribute for large scale distributed systems and networ...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
Research on resilient systems extends classical system analysis, modeling and simulation approaches....
International audienceFast evolution of computing systems is still a challenge today, but it is beco...
This paper presents a framework to assess and improve the resilience of a production system by ident...
Abstract. Deriving fault-tolerant schedulability resilience for real-time systems has been a challen...
International audienceResilience is a critical problem for extreme scale numerical simulations. The ...
In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the...
Despite the central importance of crew safety in designing and operating a life support system, the ...
A simulation-based approach to measuring the faultresilience of real-time systems is presented. Simu...
Abstract. This is an on-going work that aims at deriving metrics that represent fault resilience in ...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...