Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leadi...
The challenge of improving the performance of current processors is achieved by increasing the integ...
International audienceThis chapter describes a unified framework for the detection and correction of...
This report describes a unified framework for the detection and correction of silent errors,which co...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Transient faults are becoming a critical concern among current trends of design of general-purpose m...
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
The challenge of improving the performance of current processors is achieved by increasing the integ...
The challenge of improving the performance of current processors is achieved by increasing the integ...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
The challenge of improving the performance of current processors is achieved by increasing the integ...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
International audienceThis chapter describes a unified framework for the detection and correction of...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
The challenge of improving the performance of current processors is achieved by increasing the integ...
International audienceThis chapter describes a unified framework for the detection and correction of...
This report describes a unified framework for the detection and correction of silent errors,which co...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Transient faults are becoming a critical concern among current trends of design of general-purpose m...
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
The challenge of improving the performance of current processors is achieved by increasing the integ...
The challenge of improving the performance of current processors is achieved by increasing the integ...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
The challenge of improving the performance of current processors is achieved by increasing the integ...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
International audienceThis chapter describes a unified framework for the detection and correction of...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
The challenge of improving the performance of current processors is achieved by increasing the integ...
International audienceThis chapter describes a unified framework for the detection and correction of...
This report describes a unified framework for the detection and correction of silent errors,which co...