Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the mostcommonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs o...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Abstract–Post-silicon healing techniques that rely on built-in redundancy (e.g. row/column redundanc...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
Servers and HPC systems often use a strong memory error correction code, or ECC, to meet their relia...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Cyclic Redundancy Checks (CRC) constitute an important class of hash functions for detecting changes...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
textFuture computing platforms will increasingly demand more stringent memory resiliency mechanisms ...
Memory reliability has been a major design constraint for mission-critical and large-scale systems f...
Abstract—Today’s HPC systems use two mechanisms to ad-dress main-memory errors. Error-correcting cod...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Cyclic redundancy check (CRC) is widely used for error detection. For optimal performances a method ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Abstract–Post-silicon healing techniques that rely on built-in redundancy (e.g. row/column redundanc...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This conce...
Servers and HPC systems often use a strong memory error correction code, or ECC, to meet their relia...
As high performance computing (HPC) systems continue to grow, their fault rate increases. Applicatio...
In this work we propose partial task replication and check-pointing for task-parallel HPC applicatio...
Cyclic Redundancy Checks (CRC) constitute an important class of hash functions for detecting changes...
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Perfo...
textFuture computing platforms will increasingly demand more stringent memory resiliency mechanisms ...
Memory reliability has been a major design constraint for mission-critical and large-scale systems f...
Abstract—Today’s HPC systems use two mechanisms to ad-dress main-memory errors. Error-correcting cod...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Cyclic redundancy check (CRC) is widely used for error detection. For optimal performances a method ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Abstract–Post-silicon healing techniques that rely on built-in redundancy (e.g. row/column redundanc...
Abstract. Proposed exascale systems will present a number of consid-erable resiliency challenges. In...