Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized stor-age, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low over-head of our replication schem...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
International audienceWith the increased failure rate expected in future extreme scale supercomputer...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceA long-term trend in high-performance computing is the increasing number of no...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...