This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a protocol in the system with reduced programming effort. To support a broad range of applications, RENEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a network of workstations connected by ATM. It was observed that the communication-induced protocol emulated the behavior of the coordinated proto...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
Due to the character of the original source materials and the nature of batch digitization, quality ...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Abstract:- Checkpoint is defined as a designated place in a program at which normal processing is in...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
International audienceA long-term trend in high-performance computing is the increasing number of no...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Checkpointing of parallel applications can be used as the core technology to provide process migrati...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
104 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1998.A large number of checkpoint-...
Due to the character of the original source materials and the nature of batch digitization, quality ...
The ever increasing number of processors used in parallel computers is making fault tolerance suppor...
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. ...
International audienceIn this paper, we present a unified model for several well-known checkpoint/re...
Abstract:- Checkpoint is defined as a designated place in a program at which normal processing is in...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
Abstract — Nowadays, clusters are widely used to execute scientific applications. These applications...
This paper presents a new checkpointing algorithm for systems using reliable communication channels....
International audienceA long-term trend in high-performance computing is the increasing number of no...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
Checkpointing of parallel applications can be used as the core technology to provide process migrati...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...