As the size of high performance clusters multiplies, the prob-ability of system failure grows substantially, posing an in-creasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable stor-age. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpoint-ing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restar...
Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely d...
Because of increasing hardware and software complexity, the running time of many computational scie...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
The running times of many computational science applications are much longer than the mean-time-to-f...
Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely d...
Because of increasing hardware and software complexity, the running time of many computational scie...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
As the size of supercomputers increases, the probability of system failure grows substantially, posi...
International audienceA long-term trend in high-performance computing is the increasing number of no...
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely de...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not on...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
The running times of many computational science applications are much longer than the mean-time-to-f...
Ultra-scale computer clusters with high speed intercon-nects, such as InfiniBand, are being widely d...
Because of increasing hardware and software complexity, the running time of many computational scie...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...