Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which all processes coordinate to dump memory to stable storage simultaneously. However, in systems comprising tens of thousands of nodes, the total data volume can overwhelm the network and storage farm, creating an I/O bottleneck. Furthermore, a very large class of scientific applications can fail on these systems if one of the processes dies. Poor checkpointing performance limits checkpointing frequency and increases the time-to-solution of applications. Also, the application can spend more time in recovery and restart because large systems tend to fail often. Diskless checkpointing is a viable approach that provides high-performance and ...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
155 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2005.As a technology projection, w...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
As we move to large manycores, the hardware-based global checkpointing schemes that have been propo...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of fail...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
155 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2005.As a technology projection, w...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
As we move to large manycores, the hardware-based global checkpointing schemes that have been propo...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
International audienceFast checkpointing algorithms require distributed access to stable storage. Th...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of fail...
As the number of CPU cores in high-performance computing platforms continues to grow, the availabili...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...