<p>Chip manufacturers and hyperscalers are becoming increasingly aware of the problem posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing facilities operators like Meta and Google have emphasized the critical role of SDEs in today’s microprocessors. Numerous studies in the literature have highlighted the severity of this issue, especially in datacenter applications operating at large scales. These errors can lead to data loss and require a significant amount of time and effort to resolve through debugging engineering efforts, which can take months to complete. In this paper, we provide an overview of the issue of SDEs, including an explanation of the problem and the current methods used to addre...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
Technology and voltage scaling is making integrated circuits increasingly susceptible to failures ca...
<p>Today more than ever before, academia, manufacturers, and hyperscalers acknowledge the majo...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
<p>Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have sho...
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded syst...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
Over three decades of continuous scaling in CMOS technology has led to tremendous improvements in pr...
GPU (Graphics Processing Unit) is emerging as a key 3D/2D graphics and parallel workload accelerator...
As we move deep into nanometer regime of CMOS VLSI (45nm node and below), the device noise margin ge...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
Technology and voltage scaling is making integrated circuits increasingly susceptible to failures ca...
<p>Today more than ever before, academia, manufacturers, and hyperscalers acknowledge the majo...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
<p>Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have sho...
Silent Data Corruption (SDC) is a serious reliability issue in many domains, including embedded syst...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
Hardware errors are on the rise with reducing chip sizes, and power constraints have necessitated th...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
Over three decades of continuous scaling in CMOS technology has led to tremendous improvements in pr...
GPU (Graphics Processing Unit) is emerging as a key 3D/2D graphics and parallel workload accelerator...
As we move deep into nanometer regime of CMOS VLSI (45nm node and below), the device noise margin ge...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems ...
Technology and voltage scaling is making integrated circuits increasingly susceptible to failures ca...