Fault analysis in communication networks and distributed systems is a difficult process that heavily depends on system administrator’s experience and supporting tools. This process usually requires analytic techniques and several types of event data including log events, debug messages, trace obtained from these systems to investigate the root cause of faults. This paper introduces an approach of exploiting context-aware data and classification technique for improving this process. This approach uses both event data and context-aware data including CPU load, memory, processes, temperature, status to train a decision tree, and then applies the tree to assess suspected events. We have implemented and experimented the approach on the OpenStack...
The method for ensuring availability in an existing cloud environment is primarily a metric-based fa...
The failure analysis and resolution in cloud-computing environments are a a highly important issue, ...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...
Fault analysis in communication networks and distributed systems is a difficult process that heavily...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a h...
Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of event...
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events...
Modern IT infrastructures are constructed by large scale computing systems and administered by IT se...
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and rec...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
In order to plan for failure recovery, the designers of cloud systems need to understand how their s...
Enterprise and high-performance computing systems are growing extremely large and complex, employing...
Cloud computing is a novel technology in the field of distributed computing. Usage of Cloud computin...
Recent advances in contextual anomaly detection attempt to combine resource metrics and event logs t...
The method for ensuring availability in an existing cloud environment is primarily a metric-based fa...
The failure analysis and resolution in cloud-computing environments are a a highly important issue, ...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...
Fault analysis in communication networks and distributed systems is a difficult process that heavily...
Part 4: Applications of Parallel and Distributed ComputingInternational audienceIn modern computer s...
Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a h...
Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of event...
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events...
Modern IT infrastructures are constructed by large scale computing systems and administered by IT se...
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and rec...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
In order to plan for failure recovery, the designers of cloud systems need to understand how their s...
Enterprise and high-performance computing systems are growing extremely large and complex, employing...
Cloud computing is a novel technology in the field of distributed computing. Usage of Cloud computin...
Recent advances in contextual anomaly detection attempt to combine resource metrics and event logs t...
The method for ensuring availability in an existing cloud environment is primarily a metric-based fa...
The failure analysis and resolution in cloud-computing environments are a a highly important issue, ...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...