With the massive adoption of machine learning (ML) applications in HPC domains, the reliability of ML is also growing in importance. Specifically, ML systems are found to be vulnerable to hardware transient faults, which are growing in frequency and can result in critical failures (e.g., cause an autonomous vehicle to miss an obstacle in its path). Therefore, there is a compelling need to understand the error resilience of the ML systems and protect them from transient faults. In this thesis, we first aim to understand the error resilience of ML systems under the presence of transient faults. Traditional solutions use random fault injection (FI), which, however, is not desirable for pinpointing the vulnerable regions in the systems. The...
In the last years, Machine Learning (ML) has become extremely used in software systems: it is applie...
Some of the present day applications run on computer platforms with large and inexpensive memories, ...
The great quest for adopting AI-based computation for safety-/mission-critical applications motivate...
As Machine Learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous v...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
The drive for automation and constant monitoring has led to rapid development in the field of Machin...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new s...
Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructu...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Machine learning models have many applications, being used for example in pattern analysis, image cl...
International audienceFor many types of integrated circuits, accepting larger failure rates in compu...
Bit flips are known to be a source of strange system behavior, failures, and crashes. They can cause...
In the last years, Machine Learning (ML) has become extremely used in software systems: it is applie...
Some of the present day applications run on computer platforms with large and inexpensive memories, ...
The great quest for adopting AI-based computation for safety-/mission-critical applications motivate...
As Machine Learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous v...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
The drive for automation and constant monitoring has led to rapid development in the field of Machin...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new s...
Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructu...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Machine learning models have many applications, being used for example in pattern analysis, image cl...
International audienceFor many types of integrated circuits, accepting larger failure rates in compu...
Bit flips are known to be a source of strange system behavior, failures, and crashes. They can cause...
In the last years, Machine Learning (ML) has become extremely used in software systems: it is applie...
Some of the present day applications run on computer platforms with large and inexpensive memories, ...
The great quest for adopting AI-based computation for safety-/mission-critical applications motivate...