Large-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produced in these systems is also increasing. The goal of this research is to investigate tools that improve the reliability and help manage such systems using this wealth of data. This is a challenging problem as the scale of these machines increases the complexity, the amount of monitored data, and amount of interactions between different nodes, making the system much harder to manage and also resulting in high failure frequency. In this thesis we focus on online failure prediction and policy based management as mechanisms that can help address these issues. First, in case of failure prediction we focus on achieving an acceptable accuracy that is...
Monitoring the health of large data centers is a major concern with the ever-increasing demand of gr...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
none4noPrognostic Health Management (PHM) is a maintenance policy aimed at predicting the occurrence...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Continued reliance on human operators for managing data centers is a major impediment for them from ...
<p>Large-scale networked computing systems are widely deployed to run business-critical applications...
Network failures are still one of the main causes of distributed systems’ lack of reliability. To ov...
Abstract The operation of industrial manufacturing processes can suffer greatly when critical compo...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Continued reliance on human operators for man- aging data centers is a major impediment for them fro...
The pervasive digital innovation of the last decades has led to a remarkable transformation of maint...
ith the revolution of the internet, new applications have emerged in our daily life. People are depe...
Data centers today host a number of computational resources to support the increasing demand for com...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
Monitoring the health of large data centers is a major concern with the ever-increasing demand of gr...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
none4noPrognostic Health Management (PHM) is a maintenance policy aimed at predicting the occurrence...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Continued reliance on human operators for managing data centers is a major impediment for them from ...
<p>Large-scale networked computing systems are widely deployed to run business-critical applications...
Network failures are still one of the main causes of distributed systems’ lack of reliability. To ov...
Abstract The operation of industrial manufacturing processes can suffer greatly when critical compo...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Continued reliance on human operators for man- aging data centers is a major impediment for them fro...
The pervasive digital innovation of the last decades has led to a remarkable transformation of maint...
ith the revolution of the internet, new applications have emerged in our daily life. People are depe...
Data centers today host a number of computational resources to support the increasing demand for com...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
Monitoring the health of large data centers is a major concern with the ever-increasing demand of gr...
Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administr...
none4noPrognostic Health Management (PHM) is a maintenance policy aimed at predicting the occurrence...