Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software frameworks used for running parallel applications need to be fault-tolerant in order to ensure continued execution despite node failures. We propose an extension to the flow graph based Dynamic Parallel Schedules (DPS) development framework that allows non-trivial parallel applications to pursue their execution despite node failures. The proposed fault-tolerance mechanism relies on a set of backup threads located in the volatile storage of alternate nodes. These backup threads are kept up to date...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Wide-area parallel processing systems will soon be available to researchers to solve a range of prob...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
Dynamic parallel schedules (DPS) is a flow graph based framework for developing parallel application...
Abstract. Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel a...
Flow graphs provide an explicit description of the parallelization of an application by mapping vert...
Today, many distributed systems are deployed in high-performance computing environments such as a mu...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
This dissertation describes the design, implementation, and performance of two mechanisms that addre...
This paper addresses the question as to whether there is potential gain to be made from executing su...
Abstract. This paper presents a solution for the problem of transparent recovery of asynchronous dis...
ISBN 978-1-4577-1052-0International audienceIn this paper, a Self-Recovering strategy, which is able...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Wide-area parallel processing systems will soon be available to researchers to solve a range of prob...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
Dynamic parallel schedules (DPS) is a flow graph based framework for developing parallel application...
Abstract. Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel a...
Flow graphs provide an explicit description of the parallelization of an application by mapping vert...
Today, many distributed systems are deployed in high-performance computing environments such as a mu...
The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
This dissertation describes the design, implementation, and performance of two mechanisms that addre...
This paper addresses the question as to whether there is potential gain to be made from executing su...
Abstract. This paper presents a solution for the problem of transparent recovery of asynchronous dis...
ISBN 978-1-4577-1052-0International audienceIn this paper, a Self-Recovering strategy, which is able...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
Wide-area parallel processing systems will soon be available to researchers to solve a range of prob...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...