This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-time Node-Failure Detection (NFD) for large-scale, high load networks comprised of Commercial Off-The-Shelf (COTS) hardware and software. Nodes in the network are indepen-dent processors which may unpredictably fail either tem-porarily or permanently. We present a generalizable, multi-layer, dynamically adaptive monitoring approach to NFD where a small, designated subset of the nodes are com-municated information about node failures. This subset of nodes are notified of node failures in the network within an interval of time after the failures. Except under condi-tions of massive system failure, the NFD system has a zero false negative rate (...
International audienceBuilding an infrastructure for Exascale applications requires, in addition to ...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6663519&isnumber=6663488International audien...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Texto completo: acesso restrito. p. 1507-1524The distributed computing scenario is rapidly evolving ...
It is the age of information technology. Around the world, millions of computers are being linked t...
Failure detection is at the core of most fault tolerance strategies, but it often depends on reliabl...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
Failure detection is a fundamental building block for ensuring fault tolerance in distributed system...
Failure detection is a fundamental building block for ensuring fault tolerance in distributed system...
Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detect...
International audienceThe distributed computing scenario is rapidly evolving for integrating self-or...
Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detec-t...
One of the key reasons overlay networks are seen as an excellent platform for large scale distribute...
International audiencePropagation of faulty data is a critical issue. In case of Delay Tolerant Netw...
International audienceBuilding an infrastructure for Exascale applications requires, in addition to ...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6663519&isnumber=6663488International audien...
Abstract—In this paper, we propose an algorithm to effi-ciently diagnose large-scale clustered failu...
Texto completo: acesso restrito. p. 1507-1524The distributed computing scenario is rapidly evolving ...
It is the age of information technology. Around the world, millions of computers are being linked t...
Failure detection is at the core of most fault tolerance strategies, but it often depends on reliabl...
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomp...
Failure detection is a fundamental building block for ensuring fault tolerance in distributed system...
Failure detection is a fundamental building block for ensuring fault tolerance in distributed system...
Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detect...
International audienceThe distributed computing scenario is rapidly evolving for integrating self-or...
Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detec-t...
One of the key reasons overlay networks are seen as an excellent platform for large scale distribute...
International audiencePropagation of faulty data is a critical issue. In case of Delay Tolerant Netw...
International audienceBuilding an infrastructure for Exascale applications requires, in addition to ...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6663519&isnumber=6663488International audien...