ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particula...
Many web information systems and applications are now run as cloud-hosted systems. The consumers oft...
Metric collection and analysis is an important aspect of operational management of many systems. Ade...
Failure of application operations is one of the main causes of system-wide outages in cloud environm...
ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. T...
Cloud computing systems provide the facilities to make application services resilient against failur...
Monitoring of the large-scale data processing of the ATLAS experiment includes monitoring of product...
The ATLAS Data analytics effort is focused on creating systems which provide the ATLAS ADC with new ...
This thesis work deals with the development of two Machine Learning (ML) based systems for the autom...
The ATLAS Experiment benefits from computing resources distributed worldwide at more than 100 WLCG s...
This contribution summarizes evolution of the ATLAS Distributed Computing (ADC) Monitoring project d...
To meet a sharply increasing demand for computing resources for LHC Run 2, ATLAS distributed computi...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...
In 2015 ATLAS Distributed Computing started to migrate its monitoring systems away from Oracle DB an...
We address several problems in intelligent log management of distributed cloud computing application...
Abstract The Worldwide LHC Computing Grid (WLCG) includes more than 170 grid and cloud computing ce...
Many web information systems and applications are now run as cloud-hosted systems. The consumers oft...
Metric collection and analysis is an important aspect of operational management of many systems. Ade...
Failure of application operations is one of the main causes of system-wide outages in cloud environm...
ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. T...
Cloud computing systems provide the facilities to make application services resilient against failur...
Monitoring of the large-scale data processing of the ATLAS experiment includes monitoring of product...
The ATLAS Data analytics effort is focused on creating systems which provide the ATLAS ADC with new ...
This thesis work deals with the development of two Machine Learning (ML) based systems for the autom...
The ATLAS Experiment benefits from computing resources distributed worldwide at more than 100 WLCG s...
This contribution summarizes evolution of the ATLAS Distributed Computing (ADC) Monitoring project d...
To meet a sharply increasing demand for computing resources for LHC Run 2, ATLAS distributed computi...
Abstract—In this paper, we present CLUE, a system event analytics tool for black-box performance dia...
In 2015 ATLAS Distributed Computing started to migrate its monitoring systems away from Oracle DB an...
We address several problems in intelligent log management of distributed cloud computing application...
Abstract The Worldwide LHC Computing Grid (WLCG) includes more than 170 grid and cloud computing ce...
Many web information systems and applications are now run as cloud-hosted systems. The consumers oft...
Metric collection and analysis is an important aspect of operational management of many systems. Ade...
Failure of application operations is one of the main causes of system-wide outages in cloud environm...