Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how “dubious” an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness i...
In the modern era of computing, processors are increasingly susceptible to soft errors. Current solu...
Technology scaling has led to growing concerns about reliability in microprocessors. Currently, faul...
International audienceThis work is based on the seminar titled “Resiliency in Numerical Algorithm De...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceMany methods are available to detect silent errors in high-performance computi...
Traditionally, fault tolerance researchers have made very strict assumptions about program correctne...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Soft errors are faults which are not caused by defective hardware, rather they are induced due to no...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts a...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
In the modern era of computing, processors are increasingly susceptible to soft errors. Current solu...
Technology scaling has led to growing concerns about reliability in microprocessors. Currently, faul...
International audienceThis work is based on the seminar titled “Resiliency in Numerical Algorithm De...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
International audienceMany methods are available to detect silent errors in high-performance computi...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
International audienceMany methods are available to detect silent errors in high-performance computi...
Traditionally, fault tolerance researchers have made very strict assumptions about program correctne...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Soft errors are faults which are not caused by defective hardware, rather they are induced due to no...
International audienceThis work focuses on resilience techniques at extreme scale. Many papers deal ...
dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts a...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
In the modern era of computing, processors are increasingly susceptible to soft errors. Current solu...
Technology scaling has led to growing concerns about reliability in microprocessors. Currently, faul...
International audienceThis work is based on the seminar titled “Resiliency in Numerical Algorithm De...