As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault toleran...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
To meet an insatiable consumer demand for greater performance at less power, silicon technology has ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
In this dissertation we address the overhead reduction of fault tolerance (FT) techniques. Due to te...
In this dissertation we address the overhead reduction of fault tolerance (FT) techniques. Due to te...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
To meet an insatiable consumer demand for greater performance at less power, silicon technology has ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
According to Moore’s law, technology scaling is continuously providing smaller and faster devices. T...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
In this dissertation we address the overhead reduction of fault tolerance (FT) techniques. Due to te...
In this dissertation we address the overhead reduction of fault tolerance (FT) techniques. Due to te...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
International audienceIn this paper, we combine the traditional checkpointing and rollback recovery ...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations,...
To meet an insatiable consumer demand for greater performance at less power, silicon technology has ...
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus o...