Abstract. This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective check-pointing for multithreaded applications in heterogeneous environments. Two meth-ods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A for-mal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. I...
Scientific workflows are data- and compute-intensive; thus, they may run for days or even weeks...
In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can w...
International audienceDistributed computing infrastructures are commonly used through scientific gat...
International audienceThis paper presents a new checkpoint/recovery method for dataflow computations...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
Grid and cluster architectures are gaining in popularity for scientific computing applications. The ...
Real-world graph processing applications often require combining the graph data with tabular data. M...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
AbstractScientific workflow systems often operate in unreliable environments, and have accordingly i...
Scientific workflow systems often operate in unreliable environments, and have accordingly incorpora...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and...
The ever-increasing number of computation units assembled in current HPC platforms leads to a concer...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
With the advent of exascale computing, issues such as application irregularity and permanent hardwar...
Scientific workflows are data- and compute-intensive; thus, they may run for days or even weeks...
In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can w...
International audienceDistributed computing infrastructures are commonly used through scientific gat...
International audienceThis paper presents a new checkpoint/recovery method for dataflow computations...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
Grid and cluster architectures are gaining in popularity for scientific computing applications. The ...
Real-world graph processing applications often require combining the graph data with tabular data. M...
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typ...
AbstractScientific workflow systems often operate in unreliable environments, and have accordingly i...
Scientific workflow systems often operate in unreliable environments, and have accordingly incorpora...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and...
The ever-increasing number of computation units assembled in current HPC platforms leads to a concer...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
With the advent of exascale computing, issues such as application irregularity and permanent hardwar...
Scientific workflows are data- and compute-intensive; thus, they may run for days or even weeks...
In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can w...
International audienceDistributed computing infrastructures are commonly used through scientific gat...