International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpointing is defined for dataflow computations in large heterogeneous environments. The protocol is especially useful in massively parallel multi-threaded computations as found in cluster or grid computing and utilizes the principle of work-stealing to distribute work. By basing the state of executions on a macro dataflow graph, the protocol shows extreme flexibility with respect to rollback. Specifically, it allows local rollback in dynamic heterogeneous systems, even under a different number of processors and processes. To maximize run-time efficiency, the overhead associated with checkpointing is shifted to the rollback operations whenever poss...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
Real-world graph processing applications often require combining the graph data with tabular data. M...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
International audienceThis paper presents a new checkpoint/recovery method for dataflow computations...
Abstract. This paper presents a new checkpoint/recovery method for dataflow computations using work-...
Grid and cluster architectures are gaining in popularity for scientific computing applications. The ...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
AbstractScientific workflow systems often operate in unreliable environments, and have accordingly i...
Scientific workflow systems often operate in unreliable environments, and have accordingly incorpora...
Checkpointing and rollback recovery are techniques that can provide efficient recovery from transien...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
Real-world graph processing applications often require combining the graph data with tabular data. M...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
International audienceThis paper presents a new checkpoint/recovery method for dataflow computations...
Abstract. This paper presents a new checkpoint/recovery method for dataflow computations using work-...
Grid and cluster architectures are gaining in popularity for scientific computing applications. The ...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
AbstractScientific workflow systems often operate in unreliable environments, and have accordingly i...
Scientific workflow systems often operate in unreliable environments, and have accordingly incorpora...
Checkpointing and rollback recovery are techniques that can provide efficient recovery from transien...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
International audienceFault-tolerance protocols play an important role in today long runtime scienti...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
International audienceFault tolerance protocols play an important role in today long runtime scienti...
Real-world graph processing applications often require combining the graph data with tabular data. M...