Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leadi...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Transient faults are becoming a critical concern among current trends of design of general-purpose m...
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
The challenge of improving the performance of current processors is achieved by increasing the integ...
The challenge of improving the performance of current processors is achieved by increasing the integ...
The challenge of improving the performance of current processors is achieved by increasing the integ...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
International audienceThis chapter describes a unified framework for the detection and correction of...
The challenge of improving the performance of current processors is achieved by increasing the integ...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection...
Transient faults are becoming a critical concern among current trends of design of general-purpose m...
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
The challenge of improving the performance of current processors is achieved by increasing the integ...
The challenge of improving the performance of current processors is achieved by increasing the integ...
The challenge of improving the performance of current processors is achieved by increasing the integ...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
Transient faults are becoming a critical concern among current trends of design of generalpurpose mu...
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent...
International audienceThis chapter describes a unified framework for the detection and correction of...
The challenge of improving the performance of current processors is achieved by increasing the integ...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...