Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on todays machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve ef...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
The next generation of capability-class, massively parallel processing (MPP) systems is expected to ...
155 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2005.As a technology projection, w...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
A Parallel Single Level Store systems (PSLS) integrates a shared virtual memory and a parallel file ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, app...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
Communicated by Akihiro Fujiwara Fast checkpointing algorithms require distributed access to stable ...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
The next generation of capability-class, massively parallel processing (MPP) systems is expected to ...
155 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2005.As a technology projection, w...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
A Parallel Single Level Store systems (PSLS) integrates a shared virtual memory and a parallel file ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, app...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
Communicated by Akihiro Fujiwara Fast checkpointing algorithms require distributed access to stable ...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
The next generation of capability-class, massively parallel processing (MPP) systems is expected to ...
155 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2005.As a technology projection, w...