Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Network failures are still one of the main causes of distributed systems’ lack of reliability. To ov...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
This thesis investigates the possibility of enhancing an existing performance monitoring system for ...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
With the prosperity of Big Data, the performance and robustness of storage systems have become ever ...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Proactive failure management can produce high cost saves for companies. Machine Learning has proven ...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Network failures are still one of the main causes of distributed systems’ lack of reliability. To ov...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produc...
This thesis investigates the possibility of enhancing an existing performance monitoring system for ...
Research in the field of failure log analysis shows that spatial and temporal patterns exist among e...
We consider the problem of predicting faults in deployed, large-scale distributed systems that are h...
With the prosperity of Big Data, the performance and robustness of storage systems have become ever ...
With the increasing complexity and scope of software systems, their dependability is crucial. The a...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Proactive failure management can produce high cost saves for companies. Machine Learning has proven ...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As l...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensu...