The 300,000 CPU-core HTCondor Batch farm at CERN provides the computing power for the initial processing of data coming from the LHC experiments. Such a large-scale computing setup inevitably comes with an abundance of monitoring data; current monitoring methods cannot be configured in a reasonable amount of time to catch all the potential anomalies. Building on previous work done and ongoing in the IT department, this project will focus on the HTC batch system, using both the base monitoring metrics and the HTC job data to evaluate our options for better anomaly detection and handling
The LHCb Vertex Locator (VELO) is a silicon strip semiconductor detector operating at just 8mm dista...
For several years CERN has been offering a centralised service for Elasticsearch, a popular distribu...
In response to the demand for higher computational power, the number of computing nodes in high perf...
The LHCb experiment at CERN will have an Event Filter Farm (EFF) composed of 2000 CPUs. These machin...
The Large Hadron Collider is the world’s largest single machine and the most powerful particle accel...
For data centres it is increasingly important to monitor the network usage, and learn from network u...
The CMS experiment at the LHC relies on HTCondor and glideinWMS as its primary batch and pilot-based...
The IHEP local cluster is a middle-sized HEP data center which consists of 20'000 CPU slots, hundred...
Reliability, availability and maintainability are parameters that determine if a large-scale acceler...
Monitoring has proved to be a crucial part of the operation lifecycle of any computer system, as it ...
The LHCb experiment at the LHC accelerator at CERN collects collisions of particle bunches at 40 MHz...
The CERN automation infrastructure consists of over 600 heterogeneous industrial control systems wit...
LHCb, one of the 4 experiments at the LHC accelerator at CERN, uses approximately 1500 PCs (averagin...
Abstract The prompt reconstruction of the data recorded from the Large Hadron Collide...
International audienceThe availability of computing resources is a limiting factor in data collectio...
The LHCb Vertex Locator (VELO) is a silicon strip semiconductor detector operating at just 8mm dista...
For several years CERN has been offering a centralised service for Elasticsearch, a popular distribu...
In response to the demand for higher computational power, the number of computing nodes in high perf...
The LHCb experiment at CERN will have an Event Filter Farm (EFF) composed of 2000 CPUs. These machin...
The Large Hadron Collider is the world’s largest single machine and the most powerful particle accel...
For data centres it is increasingly important to monitor the network usage, and learn from network u...
The CMS experiment at the LHC relies on HTCondor and glideinWMS as its primary batch and pilot-based...
The IHEP local cluster is a middle-sized HEP data center which consists of 20'000 CPU slots, hundred...
Reliability, availability and maintainability are parameters that determine if a large-scale acceler...
Monitoring has proved to be a crucial part of the operation lifecycle of any computer system, as it ...
The LHCb experiment at the LHC accelerator at CERN collects collisions of particle bunches at 40 MHz...
The CERN automation infrastructure consists of over 600 heterogeneous industrial control systems wit...
LHCb, one of the 4 experiments at the LHC accelerator at CERN, uses approximately 1500 PCs (averagin...
Abstract The prompt reconstruction of the data recorded from the Large Hadron Collide...
International audienceThe availability of computing resources is a limiting factor in data collectio...
The LHCb Vertex Locator (VELO) is a silicon strip semiconductor detector operating at just 8mm dista...
For several years CERN has been offering a centralised service for Elasticsearch, a popular distribu...
In response to the demand for higher computational power, the number of computing nodes in high perf...