This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in M...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
Fault injection (FI) is an experimental technique to assess the robustness of software by deliberate...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
This dissertation summarizes experimental validation and co-design studies conducted to optimize the...
Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-...
Supercomputers have played an essential role in the progress of science and engineering research. As...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
This thesis addresses three important steps in the selection of error detection mechanisms for micro...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes ...
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate progr...
As chip densities and clock rates increase, processors are becoming more susceptible to transient fa...
Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in s...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
Fault injection (FI) is an experimental technique to assess the robustness of software by deliberate...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
This dissertation summarizes experimental validation and co-design studies conducted to optimize the...
Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-...
Supercomputers have played an essential role in the progress of science and engineering research. As...
With processor clock speeds having stagnated, parallel computing architectures have achieved a break...
This thesis addresses three important steps in the selection of error detection mechanisms for micro...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes ...
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate progr...
As chip densities and clock rates increase, processors are becoming more susceptible to transient fa...
Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in s...
Integrated electronic systems are more and more used in a wide number of applications and environmen...
Fault injection (FI) is an experimental technique to assess the robustness of software by deliberate...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...