The reliability of future general-purpose processors (GPPs) is threatened by a combination of factors like shrinking transistor size, higher clock rates, and reduced supply voltages. It is predicted that the occurrence of soft errors will dramatically increase as these trends continue. It is necessary that the processors be covered from the effects of transient faults with predictable minimal impact on performance. Such a predictable impact would allow the processor to be employed in real-time applications. In this thesis, we propose a superscalar architecture based on checkpointing buddy cache architecture. This architecture keeps the execution of tasks free from soft errors by detecting the fault in the execution state and resetting the s...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
As we move to large manycores, the hardware-based global checkpointing schemes that have been propo...
As high computing power is available at an affordable cost, we rely on microprocessor-based systems ...
The importance of fault tolerance at the processor architecture level has been made increasingly imp...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
As device density grows, each transistor gets smaller and more fragile leading to an overall higher ...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...
In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implem...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...
As we move to large manycores, the hardware-based global checkpointing schemes that have been propo...
As high computing power is available at an affordable cost, we rely on microprocessor-based systems ...
The importance of fault tolerance at the processor architecture level has been made increasingly imp...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
As device density grows, each transistor gets smaller and more fragile leading to an overall higher ...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...
In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implem...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceErrors have become a critical problem for high performance computing. Checkpoi...
Resilience has become a critical problem for high performance computing. Checkpointing protocols are...