Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. Each comes with a given cost, recall (fractionof all errors that are actually detected, i.e., false negatives),and precision (fraction of true errors amongst all detected errors,i.e., false positives).The main contribution of this paperis to characterize the optimal computing pattern for an application:which detector(s) to use, how many detectors of each type touse, together with the length of the work segment that precedes each of them.We first prove that detectors with imperfect precisions offer limited usefulness.Then we focus on detectors with perfect precision, and weconduct a comprehensive complexity analysis of this optimization proble...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
[Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU core...
The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
International audienceMany methods are available to detect silent errors in high-performance computi...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
In this thesis, spatial multiplexing-MIMO communication schemes with OFDM modulation are considered ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Projections and measurements of error rates in near-exascale and exascale systems suggest a dramati...
The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large...
High performance computing applications must be resilient to faults, which are common occurrences es...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
[Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU core...
The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Many methods are available to detect silent errors in high-performancecomputing (HPC) applications. ...
Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
This paper provides a model and an analytical study of replication as a techniqueto detect and corre...
International audienceMany methods are available to detect silent errors in high-performance computi...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
In this thesis, spatial multiplexing-MIMO communication schemes with OFDM modulation are considered ...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Projections and measurements of error rates in near-exascale and exascale systems suggest a dramati...
The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large...
High performance computing applications must be resilient to faults, which are common occurrences es...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
[Abstract] Current high-performance computing (HPC) systems are comprised of thousands of CPU core...
The conjugate gradient (CG) method is the most widely used iterative scheme forthe solution of large...