Resilience is a continuing concern for extreme-scale scientific applications. Tolerating the ever-increasing hardware fault rates demands a scalable end-to-end resilience scheme. The fundamental issue of current system-wide techniques, such as checkpoint-restart, is a one-size-fits-all approach, which globally recovers local failures. The challenges for supporting efficient resilience grow at scale with the trend of adopting accelerators. Exploiting resiliency tailored to an application can offer a potential breakthrough that enables efficient localized recovery, because an individual node maintains low failure rate at scale. I propose a framework realizing Containment Domains (CDs) that addresses the resilience challenges for future-sc...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Abstract—General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in th...
With memories continuing to dominate the area, power, cost and performance of a design, there is a c...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Continued scaling of semiconductor technology has made modern processors rely on large design margin...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept o...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
are those of the authors and should not be interpreted as representing the official policies, either...
High Performance Computing (HPC) brings with it the promise of deeper insight into complex phenomen...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
Resilience is a major roadblock for HPC executions on future exascale systems. These systems will ty...
To enable future scientific breakthroughs and discoveries, the next generation of scientific applica...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficien...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Abstract—General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in th...
With memories continuing to dominate the area, power, cost and performance of a design, there is a c...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Continued scaling of semiconductor technology has made modern processors rely on large design margin...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept o...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
are those of the authors and should not be interpreted as representing the official policies, either...
High Performance Computing (HPC) brings with it the promise of deeper insight into complex phenomen...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
Resilience is a major roadblock for HPC executions on future exascale systems. These systems will ty...
To enable future scientific breakthroughs and discoveries, the next generation of scientific applica...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficien...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Abstract—General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in th...
With memories continuing to dominate the area, power, cost and performance of a design, there is a c...