As technology scaling reaches nanometre scales,the error rate due to variations in temperature and voltage,single event effects and component degradation increases, makingcomponents less reliable. In order to ensure a system continuesto function correctly while facing known reliability issues, it isimperative that the system should have the means to detect theoccurrence of errors due to the presence of faults. A system thatbehaves normally (no error detected in the system) exhibits aprofile, and any deviations from this profile indicate that thereis an anomaly in the system. In this paper, we propose to usehardware performance counters (HPCs) to measure events thatoccur during the execution of the program. We explore thevarious counters ava...
The analysis and correct categorisation of software performance anomalies is a major challenge in cu...
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field...
Modern processors incorporate several performance monitoring units, which can be used to count event...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
Nowadays, Graphics Processing Units (GPUs) have gained importance in several domains where a high co...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
Raw data of hardware performance counters from benchmarks used in developing the early detection and...
International audienceEnergy providers are massively deploying devices to manage distributed resourc...
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Improvements in performance and energy efficiency often require deep understanding of the complex in...
Software performance anomaly detection is a major challenge in complex industrial cyber-physical sys...
Performance failures are commonplace in most computing environments; without system monitoring they ...
The analysis and correct categorisation of software performance anomalies is a major challenge in cu...
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field...
Modern processors incorporate several performance monitoring units, which can be used to count event...
Embedded systems suffer from reliability issues such as variations in temperature and voltage, singl...
Nowadays, Graphics Processing Units (GPUs) have gained importance in several domains where a high co...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
Raw data of hardware performance counters from benchmarks used in developing the early detection and...
International audienceEnergy providers are massively deploying devices to manage distributed resourc...
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Improvements in performance and energy efficiency often require deep understanding of the complex in...
Software performance anomaly detection is a major challenge in complex industrial cyber-physical sys...
Performance failures are commonplace in most computing environments; without system monitoring they ...
The analysis and correct categorisation of software performance anomalies is a major challenge in cu...
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field...
Modern processors incorporate several performance monitoring units, which can be used to count event...