For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a scalable fault diagnosis is necessary, which works efficiently even if there are several thousand processors in the system. In this paper we present an event-driven, distributed system-level diagnosis algorithm, based on a general diagnosis model which does not limit the number of simultaneously existing faults. In particular, the relation between error detection and fault localization as well as two different methods for distributing diagnostic information are examined in detail. Furthermore, we give measurements concerning how does our diagnosis algorithm affect application performance
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
We develop a widely applicable algorithm to solve the fault diagnosis problem in certain distributed...
For constructing fault tolerance mechanisms in large massively parallel multipro-cessor systems, a s...
This dissertation addresses the distributed self-diagnosis of multiprocessor/multicomputer systems b...
Massively parallel multiprocessors induce new requirements for system-level fault diagnosis, like ha...
AbstractWe consider problems of fault diagnosis in multiprocessor systems. Preparata, Metze and Chie...
We consider problems of fault diagnosis in multiprocessor systems. Preparata, Metze and Chien (1967)...
AbstractWe consider the problem of fault diagnosis in multiprocessor systems. Every processor can te...
The distributed self-diagnosis of a multiprocessor/multicomputer system based on interprocessor test...
Constraint-based diagnosis algorithms for multiprocessors A. Petri, P. Urban, J. Altmann, M. Dal Cin...
In the latest years, new ideas appeared in system level diagnosis of multiprocessor systems. In cont...
Probabilistic diagnosis aims at making the system-level fault diagnostic problem both easier to solv...
AbstractA fault diagnosis model for multiprocessor computers is proposed. Under normal operating mod...
A new method for local diagnosis in regularly interconnected massively parallel systems with fault-c...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
We develop a widely applicable algorithm to solve the fault diagnosis problem in certain distributed...
For constructing fault tolerance mechanisms in large massively parallel multipro-cessor systems, a s...
This dissertation addresses the distributed self-diagnosis of multiprocessor/multicomputer systems b...
Massively parallel multiprocessors induce new requirements for system-level fault diagnosis, like ha...
AbstractWe consider problems of fault diagnosis in multiprocessor systems. Preparata, Metze and Chie...
We consider problems of fault diagnosis in multiprocessor systems. Preparata, Metze and Chien (1967)...
AbstractWe consider the problem of fault diagnosis in multiprocessor systems. Every processor can te...
The distributed self-diagnosis of a multiprocessor/multicomputer system based on interprocessor test...
Constraint-based diagnosis algorithms for multiprocessors A. Petri, P. Urban, J. Altmann, M. Dal Cin...
In the latest years, new ideas appeared in system level diagnosis of multiprocessor systems. In cont...
Probabilistic diagnosis aims at making the system-level fault diagnostic problem both easier to solv...
AbstractA fault diagnosis model for multiprocessor computers is proposed. Under normal operating mod...
A new method for local diagnosis in regularly interconnected massively parallel systems with fault-c...
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion o...
We consider issues of fault tolerance for distributed computing systems at two levels of system desi...
We develop a widely applicable algorithm to solve the fault diagnosis problem in certain distributed...