High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of c...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceComputer clusters are today the reference architecture for high-performance co...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceComputer clusters are today the reference architecture for high-performance co...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Abstract—HPC community projects that future extreme scale systems will be much less stable than curr...
Abstract—As the capability and component count of systems increase, the MTBF decreases. Typically, a...
The advent of cluster computing has resulted in a thrust towards providing software mechanisms for r...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceComputer clusters are today the reference architecture for high-performance co...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations ...