Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data cor-ruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by im-plementing on-demand page integrity verification. Experimental bench-marks with Mantevo HPCCG show that once tuned,...
The key objective of database systems is to reliably manage data, whereby high query throughput and ...
Supercomputers offer new opportunities for scientific computing as they grow in size. However, their...
As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
DRAM scaling has been the prime driver for increasing the capac-ity of main memory system over the p...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
Abstract: Now-a-days, the memory devices are susceptible to Single Event Upsets (SEU) which is one o...
The reliability of memory subsystem is fast becoming a concern in computer architecture and system d...
International audienceThis chapter describes a unified framework for the detection and correction of...
The key objective of database systems is to reliably manage data, whereby high query throughput and ...
Supercomputers offer new opportunities for scientific computing as they grow in size. However, their...
As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Abstract: Many methods are available to detect silent errors in high-performance computing (HPC) app...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
DRAM scaling has been the prime driver for increasing the capac-ity of main memory system over the p...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
Abstract: Now-a-days, the memory devices are susceptible to Single Event Upsets (SEU) which is one o...
The reliability of memory subsystem is fast becoming a concern in computer architecture and system d...
International audienceThis chapter describes a unified framework for the detection and correction of...
The key objective of database systems is to reliably manage data, whereby high query throughput and ...
Supercomputers offer new opportunities for scientific computing as they grow in size. However, their...
As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults...