Full system reliability is a problem that spans multiple levels of the software/hardware stack. The normal execution of a program in a system can be disrupted by multiple factors, ranging from transient errors in a processor and software bugs, to permanent hardware failures and human mistakes. A common method for recovering from such errors is the creation of checkpoints during the execution of the program, allowing the system to restore the program to a previous error-free state and resume execution. Different causes of errors, though, have different occurrence frequencies and detection latencies, requiring the creation of multiple checkpoints at different frequencies in order to maximize the availability of the system. In this ...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
The problems of software debugging and system reliability/availability are among the most challengin...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Memory system design is important for providing high reliability and availability. This dissertation...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
The problems of software debugging and system reliability/availability are among the most challengin...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Memory system design is important for providing high reliability and availability. This dissertation...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
The coming exascale era is a great opportunity for high performance computing (HPC) applications. Ho...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...