The primary mechanism for overcoming faults in modern storage systems is to introduce redundancy in the form of replication and error correcting codes. The costs of such redundancy in hardware, system availability and overall complexity can be substantial, depending on the number and pattern of faults that are handled. This dissertation describes and analyzes, via simulation, a system that seeks to use disk failure avoidance to reduce the need for costly redundancy by using adaptive heuristics that anticipate such failures. While a number of predictive factors can be used, this research focuses on the three leading candidates of SMART errors, age and vintage. This approach can predict where near term disk failures are more likely to occur, ...
The objective of this research is to develop design methodologies for scalable and reliable memory s...
Distributed storage systems are constrained by the finite speed of propagation of information. The C...
Failure is inevitable: disks fail, hosts crash, networks partition, applications stop. Consequently...
Modern storage systems continue to increase in scale and complexity as they attempt to meet the inc...
During the past decade, advances in processor and memory technology have given rise to increases in ...
Today's most reliable data storage systems are made of redundant arrays of inexpensive disks (RAID)....
Systems suffer component failure at sometimes un-predictable rates. Storage systems are no exception...
As we look toward exascale it is clear that high-capacity HPC storage systems will incorporate the l...
Technology scaling has led to growing concerns about reliability in microprocessors. Currently, faul...
textWhen building storage systems that aim to simultaneously provide robustness, scalability, and ef...
This research addresses design of a reliable computer from unreliable device technologies. A system ...
With the explosive increase in the amount of data being generated by various applications, large-sca...
AbstractReliability is a major concern in the design of large disk arrays. In this paper, we examine...
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer S...
There exists a wide variety of applications in which data availability must be continuous, that is, ...
The objective of this research is to develop design methodologies for scalable and reliable memory s...
Distributed storage systems are constrained by the finite speed of propagation of information. The C...
Failure is inevitable: disks fail, hosts crash, networks partition, applications stop. Consequently...
Modern storage systems continue to increase in scale and complexity as they attempt to meet the inc...
During the past decade, advances in processor and memory technology have given rise to increases in ...
Today's most reliable data storage systems are made of redundant arrays of inexpensive disks (RAID)....
Systems suffer component failure at sometimes un-predictable rates. Storage systems are no exception...
As we look toward exascale it is clear that high-capacity HPC storage systems will incorporate the l...
Technology scaling has led to growing concerns about reliability in microprocessors. Currently, faul...
textWhen building storage systems that aim to simultaneously provide robustness, scalability, and ef...
This research addresses design of a reliable computer from unreliable device technologies. A system ...
With the explosive increase in the amount of data being generated by various applications, large-sca...
AbstractReliability is a major concern in the design of large disk arrays. In this paper, we examine...
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer S...
There exists a wide variety of applications in which data availability must be continuous, that is, ...
The objective of this research is to develop design methodologies for scalable and reliable memory s...
Distributed storage systems are constrained by the finite speed of propagation of information. The C...
Failure is inevitable: disks fail, hosts crash, networks partition, applications stop. Consequently...