Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately det...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
It is a great challenge to build reliable computer systems with unreliable hardware and buggy softwa...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
Enterprise and high-performance computing systems are growing extremely large and complex, employing...
Short overview: Both Grid middleware services and applications face failures, and the more widely de...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
One of the important design criteria for distributed systems and their applications is their reliabi...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
A popular approach for producing parallel software is to de-velop a sequential version of an applica...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
When working with distributed systems, detecting faults can be a difficult task, as abnormalities is...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
It is a great challenge to build reliable computer systems with unreliable hardware and buggy softwa...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
Enterprise and high-performance computing systems are growing extremely large and complex, employing...
Short overview: Both Grid middleware services and applications face failures, and the more widely de...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
One of the important design criteria for distributed systems and their applications is their reliabi...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
A popular approach for producing parallel software is to de-velop a sequential version of an applica...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
When working with distributed systems, detecting faults can be a difficult task, as abnormalities is...
Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose fai...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typica...
It is a great challenge to build reliable computer systems with unreliable hardware and buggy softwa...