The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. In the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In this paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique - passive replication - is explored in detail as it forms the basis for fault tolerance in the ...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Fault tolerance in distributed computing is a wide area with a significant body of literature that i...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
This research proposes an algorithm for fault-tolerance in a home-based lazy release consistent dist...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
This paper addresses the question as to whether there is potential gain to be made from executing su...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Proceedings of the International Conference on Information, Communications and Signal Processing, IC...
As people are becoming increasingly dependent on computerized systems, the need for these systems to...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
Due to the character of the original source materials and the nature of batch digitization, quality ...
This book covers the most essential techniques for designing and building dependable distributed sys...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Fault tolerance in distributed computing is a wide area with a significant body of literature that i...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
This research proposes an algorithm for fault-tolerance in a home-based lazy release consistent dist...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
This paper addresses the question as to whether there is potential gain to be made from executing su...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to di...
Due to the character of the original source materials and the nature of batch digitization, quality ...
Proceedings of the International Conference on Information, Communications and Signal Processing, IC...
As people are becoming increasingly dependent on computerized systems, the need for these systems to...
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and sile...
Due to the character of the original source materials and the nature of batch digitization, quality ...
This book covers the most essential techniques for designing and building dependable distributed sys...
This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared Memory Mul...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Fault tolerance in distributed computing is a wide area with a significant body of literature that i...