As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that l...
Improving the reliability and performance are of utmost importance for any system. This thesis prese...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded syst...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
This report describes a unified framework for the detection and correction of silent errors,which co...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
Data reduction techniques have been widely demanded and used by large-scale high performance computi...
Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes o...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
<p>Chip manufacturers and hyperscalers are becoming increasingly aware of the problem posed by...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
International audienceMany methods are available to detect silent errors in high-performance computi...
Improving the reliability and performance are of utmost importance for any system. This thesis prese...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded syst...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
This report describes a unified framework for the detection and correction of silent errors,which co...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
Data reduction techniques have been widely demanded and used by large-scale high performance computi...
Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes o...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
<p>Chip manufacturers and hyperscalers are becoming increasingly aware of the problem posed by...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
International audienceMany methods are available to detect silent errors in high-performance computi...
Improving the reliability and performance are of utmost importance for any system. This thesis prese...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...