Abstract — A major problem in managing large-scale datacenters is diagnosing and fixing machine failures. Most large datacenter deployments have a management infrastructure that can help diagnose failure causes, and manage assets that were fixed as part of the repair process. Previous studies identify only actual hardware replacements to calculate Annualized Failure Rate (AFR) and component reliability. In this paper, we show that service availability is significantly affected by soft failures and that this class of failures is becoming an important issue at large datacenters with minimum human intervention. Soft failures in the datacenter do not require actual hardware replacements, but still result in service downtime, and are equally imp...
Data center downtime causes business losses over a million dollars per hour. 24x7-hour data availabi...
The proliferation of distributed internet services has reaffirmed the need for reliable and high-per...
<p>Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have sho...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Despite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about ...
Thesis (Ph.D.)--University of Washington, 2018Fast and accurate failure diagnosis remains a major ch...
The workloads running in the modern data centers of large scale Internet service providers (such asA...
Enterprise data centers are provisioned with conservative redundancies built into their power infras...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Distributed computing systems cover a broad range of computing infrastructures, which are heterogene...
The cloud data center is a complex system composed of power, cooling, and IT subsystems. The power s...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at datac...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
Data center downtime causes business losses over a million dollars per hour. 24x7-hour data availabi...
The proliferation of distributed internet services has reaffirmed the need for reliable and high-per...
<p>Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have sho...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Despite the growing popularity of Solid State Disks (SSDs) in the datacenter, little is known about ...
Thesis (Ph.D.)--University of Washington, 2018Fast and accurate failure diagnosis remains a major ch...
The workloads running in the modern data centers of large scale Internet service providers (such asA...
Enterprise data centers are provisioned with conservative redundancies built into their power infras...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Distributed computing systems cover a broad range of computing infrastructures, which are heterogene...
The cloud data center is a complex system composed of power, cooling, and IT subsystems. The power s...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at datac...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
Data center downtime causes business losses over a million dollars per hour. 24x7-hour data availabi...
The proliferation of distributed internet services has reaffirmed the need for reliable and high-per...
<p>Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have sho...