Short overview: Both Grid middleware services and applications face failures, and the more widely deployed they are, the higher is the price for not detecting the failures early (lost jobs, wasted resources ...). Automated detection, diagnosis, and ultimately management, of software/hardware problems define autonomic dependability. This work report on a generic mechanism for autonomic detection of EGEE failures involving abrupt changes in the behaviour of quantities of interest, and on some applications. Analysis: The complexity of the hardware/software components, and the intricacy of their interactions, defeat attempts to build fault models only from a-priori knowledge. A black-box approach, where we observe the events to spot outliers, i...
Abstract — Failure detection and group membership manage-ment are basic building blocks for self-rep...
Selected for publication in the post-conference bookComputing grids are large-scale, highly-distribu...
Large software systems are extremely complex and based on code that is constantly changing with bug ...
Short overview: Both Grid middleware services and applications face failures, and the more widely de...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
The emergence of Grid infrastructures like EGEE has enabled the deployment of large-scale computatio...
Production run software failures cause endless grief to end-users, and endless challenges to program...
Web applications suffer from software and configuration faults that lower their availability. Recove...
Failure detection is a basic service for building dependable systems. The large-scale distribution o...
One of the important design criteria for distributed systems and their applications is their reliabi...
Typically, emerging system failures have a strong impact on the performance of industrial systems as...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
In many industrial processes, faults are susceptible to occur and can sometimes have dramatic and/or...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Thanks to the Grid, users have access to computing resources distributed all over the world. The Gri...
Abstract — Failure detection and group membership manage-ment are basic building blocks for self-rep...
Selected for publication in the post-conference bookComputing grids are large-scale, highly-distribu...
Large software systems are extremely complex and based on code that is constantly changing with bug ...
Short overview: Both Grid middleware services and applications face failures, and the more widely de...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
The emergence of Grid infrastructures like EGEE has enabled the deployment of large-scale computatio...
Production run software failures cause endless grief to end-users, and endless challenges to program...
Web applications suffer from software and configuration faults that lower their availability. Recove...
Failure detection is a basic service for building dependable systems. The large-scale distribution o...
One of the important design criteria for distributed systems and their applications is their reliabi...
Typically, emerging system failures have a strong impact on the performance of industrial systems as...
Distributed software systems have become the backbone of Internet services. Failures in pro-duction ...
In many industrial processes, faults are susceptible to occur and can sometimes have dramatic and/or...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Thanks to the Grid, users have access to computing resources distributed all over the world. The Gri...
Abstract — Failure detection and group membership manage-ment are basic building blocks for self-rep...
Selected for publication in the post-conference bookComputing grids are large-scale, highly-distribu...
Large software systems are extremely complex and based on code that is constantly changing with bug ...