This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers fr...
To enable future scientific breakthroughs and discoveries, the next generation of scientific applica...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
International audienceWorkflows systems are considered here to support large-scale multiphysics simu...
International audienceThis work is based on the seminar titled “Resiliency in Numerical Algorithm De...
We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scien...
We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scien...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
International audienceResilience is a critical problem for extreme scale numerical simulations. The ...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held ...
Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computin...
In an era where we can not afford to checkpoint frequently, replication is a generic way forward to ...
The work described in this paper aims at effective computation resilience for complex simulations in...
To enable future scientific breakthroughs and discoveries, the next generation of scientific applica...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
International audienceWorkflows systems are considered here to support large-scale multiphysics simu...
International audienceThis work is based on the seminar titled “Resiliency in Numerical Algorithm De...
We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scien...
We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scien...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
International audienceResilience is a critical problem for extreme scale numerical simulations. The ...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Resilient algorithms in high-performance computing are subject to rigorous non-functional constrain...
We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held ...
Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computin...
In an era where we can not afford to checkpoint frequently, replication is a generic way forward to ...
The work described in this paper aims at effective computation resilience for complex simulations in...
To enable future scientific breakthroughs and discoveries, the next generation of scientific applica...
This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We ad...
International audienceWorkflows systems are considered here to support large-scale multiphysics simu...