Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABILITY, and SERVICEABILITY (RAS) aspects for solving resilience issues in highend scientific computing in the next generation of supercomputers. results lie in the following tracks: Failure prediction in a large scale HPC; Investigate reliability issues and mitigation techniques including in GPGPU-based HPC system; HPC resilience runtime & tools
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
Resilience is a major roadblock for HPC executions on future exascale systems. These systems will ty...
Supercomputers have played an essential role in the progress of science and engineering research. As...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
GPGPUs are used increasingly in several domains, from gaming to different kinds of compu...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University...
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. ...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
Resilience is a major roadblock for HPC executions on future exascale systems. These systems will ty...
Supercomputers have played an essential role in the progress of science and engineering research. As...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
GPGPUs are used increasingly in several domains, from gaming to different kinds of compu...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...