In order to address the needs of future scientific applications for storing and accessing large amounts of data in an efficient way, one needs to understand the limitations of current technologies and how they may cause system instability or unavailability. A number of factors can impact system availability ranging from facility-wide power outage to a single point of failure such as network switches or global file systems. In addition, individual component failure in a system can degrade the performance of that system. This paper focuses on analyzing both of these factors and their impacts on the computational and storage systems at NERSC. Component failure data presented in this report primarily focuses on disk drive in on of the computat...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Resource failures and down times have become a growing concern for large-scale computational platfor...
NERSC's Global File system (NGF), accessible from all compute systems at NERSC, holds files and data...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Abstract NERSC's Global File system (NGF), accessible from all compute systems at NERSC, holds ...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABIL...
Objective of the on going activity is to develop a fusion specific component failure database with d...
Failure Prediction has long known to be a challenging problem. With the evolving trend of technology...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Objective of the on going activity is to develop a fusion specific component failure database with d...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Modern storage systems continue to increase in scale and complexity as they attempt to meet the inc...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Resource failures and down times have become a growing concern for large-scale computational platfor...
NERSC's Global File system (NGF), accessible from all compute systems at NERSC, holds files and data...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Abstract NERSC's Global File system (NGF), accessible from all compute systems at NERSC, holds ...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABIL...
Objective of the on going activity is to develop a fusion specific component failure database with d...
Failure Prediction has long known to be a challenging problem. With the evolving trend of technology...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Objective of the on going activity is to develop a fusion specific component failure database with d...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Modern storage systems continue to increase in scale and complexity as they attempt to meet the inc...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Resource failures and down times have become a growing concern for large-scale computational platfor...
NERSC's Global File system (NGF), accessible from all compute systems at NERSC, holds files and data...