Supercomputers have played an essential role in the progress of science and engineering research. As the high-performance computing (HPC) community moves towards the next generation of HPC computing, it faces several challenges, one of which is reliability of HPC systems. Error rates are expected to significantly increase on exascale systems to the point where traditional application-level checkpointing may no longer be a viable fault tolerance mechanism. This poses serious ramifications for a system's ability to guarantee reliability and availability of its resources. It is becoming increasingly important to understand fault-to-failure propagation and to identify key areas of instrumentation in HPC systems for avoidance, detection, diagnos...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Supercomputers have played an essential role in the progress of science and engineering research. As...
This thesis presents two unique sets of fault injections on mission-critical computer systems with t...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
High Performance Computing (HPC) brings with it the promise of deeper insight into complex phenomen...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Supercomputers have played an essential role in the progress of science and engineering research. As...
This thesis presents two unique sets of fault injections on mission-critical computer systems with t...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
High Performance Computing (HPC) brings with it the promise of deeper insight into complex phenomen...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...