his paper presents a new transparent, incremental, concurrent checkpoint mechanism for embedded multi-core systems. It allows the checkpointed process (also called checkpointee) to continue running without stopping while checkpoints are set to a large extent. Through tracing TLB misses to block the accesses to target memory pages first time while dumping memory pages (the most time-consuming step when setting a checkpoint). At that time, a kernel thread, called checkpointer, copies the memory access target pages to the designated memory buffer for constructing a consistent state of the checkpointee, and then resumes the memory accesses. From the experimental results, in contrast to a traditional concurrent checkpoint system, the proposed me...
Checkpointing enables us to reduce the time to recover from a fault by saving intermediate states of...
Checkpointing is widely used in robust fault-tolerant applications. We present an efficient incremen...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
We describe the software architecture, technical fea-tures, and performance of TICK (Transparent Inc...
Traditional checkpoint and recovery are based upon two basic assumptions. The first is the need to h...
AbstractThere are two approaches to reduce the overhead associated with coordinated checkpointing: f...
This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) p...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
Checkpointing enables us to reduce the time to recover from a fault by saving intermediate states of...
Checkpointing is widely used in robust fault-tolerant applications. We present an efficient incremen...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
We describe the software architecture, technical fea-tures, and performance of TICK (Transparent Inc...
Traditional checkpoint and recovery are based upon two basic assumptions. The first is the need to h...
AbstractThere are two approaches to reduce the overhead associated with coordinated checkpointing: f...
This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) p...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
In checkpointing schemes with task duplication, checkpointing serves two purposes: detecting faults ...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
Checkpointing enables us to reduce the time to recover from a fault by saving intermediate states of...
Checkpointing is widely used in robust fault-tolerant applications. We present an efficient incremen...
textTo make progress in the face of failures, long-running parallel applications need to save their ...