In this paper, we study real-time in-memory checkpointing as an effective means to improve the reliability of future large-scale parallel processing systems. Under this context, the checkpoint overhead can become a significant perfor-mance bottleneck. Novel memory system designs with upcoming non-volatile random access memory (NVRAM) technologies are emerging to address this performance is-sue. However, we find that those designs can still have prohibitively high checkpoint overhead and system down-time, especially when checkpoints are taken frequently to implement a reliable system. In this paper, we propose a novel in-memory checkpointing system, named Mona, for reducing the checkpoint overhead of hybrid memory systems with NVRAM and DRAM...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
AbstractThe increasing size of computational clusters results in an increasing probability of failur...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
A Parallel Single Level Store systems (PSLS) integrates a shared virtual memory and a parallel file ...
This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (...
High-frequency memory checkpointing is an important technique in several application domains, such a...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
Memory system design is important for providing high reliability and availability. This dissertation...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
Checkpointing is a pivotal technique in system research, with applications ranging from crash recove...
AbstractThe increasing size of computational clusters results in an increasing probability of failur...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
Checkpointing schemes enable fault-tolerant parallel and distributed computing by leveraging the red...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
A Parallel Single Level Store systems (PSLS) integrates a shared virtual memory and a parallel file ...
This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (...
High-frequency memory checkpointing is an important technique in several application domains, such a...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
A new transparent, incremental, concurrent checkpoint mechanism for real-time and interactive applic...
Memory system design is important for providing high reliability and availability. This dissertation...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Since the last decade, computing systems turn to large scale parallel platforms composed of thousand...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...