Grid and cluster architectures are gaining in popularity for scientific computing applications. The distributed computations, as well as their underlying infrastructure consisting of a large number of computers, storage and networking devices, pose challenges in overcoming the effects of node failures. This work presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods are presented, i.e. Systematic Event Logging (SEL) and Theft-Induced C...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Grid and cluster architectures are gaining in popularity for scientific computing applications. The ...
International audienceThis paper presents a new checkpoint/recovery method for dataflow computations...
Abstract. This paper presents a new checkpoint/recovery method for dataflow computations using work-...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
This work deals with scheduling and checkpointing strategies to execute scientific workflows on fail...
The construction of grid computing is one of the major research on networked computer systems . The ...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
Real-world graph processing applications often require combining the graph data with tabular data. M...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...
Grid and cluster architectures are gaining in popularity for scientific computing applications. The ...
International audienceThis paper presents a new checkpoint/recovery method for dataflow computations...
Abstract. This paper presents a new checkpoint/recovery method for dataflow computations using work-...
International audiencen this paper a new checkpoint/recovery protocol called theft-induced checkpoin...
This work deals with scheduling and checkpointing strategies to execute scientific workflows on fail...
The construction of grid computing is one of the major research on networked computer systems . The ...
International audienceThe ever-increasing number of computation units assembled in current HPC platf...
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault ...
Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel appli...
Real-world graph processing applications often require combining the graph data with tabular data. M...
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience technique...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bi...
We focus on High Performance Computing (HPC) workflows whose dependency graphforms a linear chain, a...