Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management
Large-scale data center networks are complex - comprising several thousand network devices and sever...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Network failures are still one of the main causes of distributed systems’ lack of reliability. To ov...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
Proactive failure management can produce high cost saves for companies. Machine Learning has proven ...
This thesis investigates the possibility of enhancing an existing performance monitoring system for ...
With the prosperity of Big Data, the performance and robustness of storage systems have become ever ...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Large-scale data center networks are complex - comprising several thousand network devices and sever...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Network failures are still one of the main causes of distributed systems’ lack of reliability. To ov...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
Proactive failure management can produce high cost saves for companies. Machine Learning has proven ...
This thesis investigates the possibility of enhancing an existing performance monitoring system for ...
With the prosperity of Big Data, the performance and robustness of storage systems have become ever ...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Large-scale data center networks are complex - comprising several thousand network devices and sever...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...