This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in M...
With the advances of very large scale integration (VLSI) technology, the feature size has been shrin...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
In this paper, we explore the implementation of fault simulation on a Graphics Processing Unit (GPU)...
This dissertation summarizes experimental validation and co-design studies conducted to optimize the...
Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-...
As chip densities and clock rates increase, processors are becoming more susceptible to transient fa...
This thesis addresses three important steps in the selection of error detection mechanisms for micro...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
In the research reported in this paper, transient faults were injected in the nodes and in the commu...
Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes ...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Fault injection (FI) is an experimental technique to assess the robustness of software by deliberate...
With the advances of very large scale integration (VLSI) technology, the feature size has been shrin...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
In this paper, we explore the implementation of fault simulation on a Graphics Processing Unit (GPU)...
This dissertation summarizes experimental validation and co-design studies conducted to optimize the...
Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-...
As chip densities and clock rates increase, processors are becoming more susceptible to transient fa...
This thesis addresses three important steps in the selection of error detection mechanisms for micro...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
In the research reported in this paper, transient faults were injected in the nodes and in the commu...
Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes ...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Fault injection (FI) is an experimental technique to assess the robustness of software by deliberate...
With the advances of very large scale integration (VLSI) technology, the feature size has been shrin...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
In this paper, we explore the implementation of fault simulation on a Graphics Processing Unit (GPU)...