As technology feature size continues to shrink, we see two challenging problems in the design of computer systems. One is the hardware unreliability due to increasing chances of transient hardware faults caused by high-energy particles and temperature hot spots. The other is the variability in the semiconductor manufacturing process, which manifests itself as a large variation in gate lengths, threshold voltage, and other parameters within a wafer and even within a die. This variability eventually impacts the frequency and the leakage power dissipation of a chip. In the first part of this thesis, we study the problem of handling I/O in memory-based check-pointing systems. The increasing demand for reliable computers has led to proposals for...
Recent advances in deep submicron (DSM) technology have imposed an adverse impact on the long-term l...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
It is getting increasingly difficult to verify processors and guarantee subsequent reliable operatio...
As technology feature size continues to shrink, we see two challenging problems in designing compute...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...
Memory system design is important for providing high reliability and availability. This dissertation...
The checkpoint and rollback recovery techniques enable a system to survive failures by periodically ...
System reliability is becoming a significant concern as technology continues to shrink. This is beca...
As transistor technology scales ever further, hardware reliability is becoming harder to manage. Th...
In recent years, circuit reliability in modern high-performance processors has become increasingly i...
In recent years, circuit reliability in modern high-performance processors has become increasingly i...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
Reliability is a fundamental challenge for current and future microprocessors with advanced nanoscal...
We present a method to recover from failures caused by soft-ware bugs. Our method relies on two key ...
Abstract—VLSI systems in the nanometer regime suffer from high defect rates and large parametric var...
Recent advances in deep submicron (DSM) technology have imposed an adverse impact on the long-term l...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
It is getting increasingly difficult to verify processors and guarantee subsequent reliable operatio...
As technology feature size continues to shrink, we see two challenging problems in designing compute...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...
Memory system design is important for providing high reliability and availability. This dissertation...
The checkpoint and rollback recovery techniques enable a system to survive failures by periodically ...
System reliability is becoming a significant concern as technology continues to shrink. This is beca...
As transistor technology scales ever further, hardware reliability is becoming harder to manage. Th...
In recent years, circuit reliability in modern high-performance processors has become increasingly i...
In recent years, circuit reliability in modern high-performance processors has become increasingly i...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
Reliability is a fundamental challenge for current and future microprocessors with advanced nanoscal...
We present a method to recover from failures caused by soft-ware bugs. Our method relies on two key ...
Abstract—VLSI systems in the nanometer regime suffer from high defect rates and large parametric var...
Recent advances in deep submicron (DSM) technology have imposed an adverse impact on the long-term l...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
It is getting increasingly difficult to verify processors and guarantee subsequent reliable operatio...