In a large scale real-time distributed system, a large number of components and the time criticality of tasks can contribute to complex situations. Providing predictable and reliable service is a paramount interest in such a system. For example, a single point failure in an electric grid system may lead to a widespread power outage like the Northeast Blackout of 2003. System design and implementation address fault avoidance and mitigation. However, not all faults and failures can be removed during these phases, and therefore run-time fault avoidance and mitigation are needed during the operation. Timing constraints and predictability of the system behavior are important concerns in a large scale system as well. This dissertation proposes se...
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication of...
The electric power grid is a complex, interconnected cyber-physical system comprised of collaboratin...
It is expected that future power systems will require radical distributed control approaches to acco...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
The paper proposes a methodology to effectively address the increasingly important problem of distri...
Networked systems present some key new challenges in the development of fault diagnosis architecture...
Fault-tolerance in distributed computing systems has been investigated extensively in the literature...
Distributed systems and extreme-scale systems are ubiquitous in recent years and have seen throughou...
Networked systems present some key new challenges in the development of fault diagnosis architecture...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
Fault Tolerance is an important issue considered when developing a reliable Distributed System. Reac...
Failure detection is at the core of most fault tolerance strategies, but it often depends on reliabl...
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...
The paper proposes a methodology to effectively address the increasingly important problem of distri...
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication of...
The electric power grid is a complex, interconnected cyber-physical system comprised of collaboratin...
It is expected that future power systems will require radical distributed control approaches to acco...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
The paper proposes a methodology to effectively address the increasingly important problem of distri...
Networked systems present some key new challenges in the development of fault diagnosis architecture...
Fault-tolerance in distributed computing systems has been investigated extensively in the literature...
Distributed systems and extreme-scale systems are ubiquitous in recent years and have seen throughou...
Networked systems present some key new challenges in the development of fault diagnosis architecture...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
Fault Tolerance is an important issue considered when developing a reliable Distributed System. Reac...
Failure detection is at the core of most fault tolerance strategies, but it often depends on reliabl...
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and...
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems ...
The paper proposes a methodology to effectively address the increasingly important problem of distri...
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication of...
The electric power grid is a complex, interconnected cyber-physical system comprised of collaboratin...
It is expected that future power systems will require radical distributed control approaches to acco...