This paper reports on the design and implementation of the HPC performance monitoring system deployed to continuously monitor performance metrics of all jobs on the HPC systems at the Max Planck Computing and Data Facility (MPCDF). Thereby it reveals important information to various stakeholders, in particular to users, application support, system administrators, and management. On each compute node, hardware and software performance monitoring data is collected by our newly developed lightweight open-source hpcmd middleware which builds upon standard Linux tools. The data is transported via rsyslog, and aggregated and processed by a Splunk system, enabling detailed per-cluster and per-job interactive analysis in a web browser. Additionally...
HPC application developers encounter significant challenges getting their codes to run correctly on ...
Modern parallel systems and applications are constantly increasing in scale and complexity, and cons...
Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jo...
The HPC service at CERN provides linux batch infrastructure to run high performance computing appli...
This paper introduces an infrastructure for efficiently collecting performance profiles from paralle...
Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terab...
In this work, system monitoring and analysis are discussed in terms of their sig- nificance and bene...
Rapport du stage de Magistère d'informatique préparé parallèlement à la première année de Master d'i...
One key to improving high performance computing (HPC) productivity is to find better ways to measure...
International audienceNowadays, power and energy consumption are of paramount importance. Further, r...
A considerably fraction of science discovery is nowadays relying on computer simulations. High Per...
In this paper we describe the architecture of PerfMC, a performance monitoring system for clusters o...
System and job monitoring are two established way of measuring the utilization of HPC systems. Due t...
Given the complexity of modern HPC systems, achieving theoretical peak performance depends on a myri...
International audienceA new tool and web portal are presented for deployment of High Performance Com...
HPC application developers encounter significant challenges getting their codes to run correctly on ...
Modern parallel systems and applications are constantly increasing in scale and complexity, and cons...
Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jo...
The HPC service at CERN provides linux batch infrastructure to run high performance computing appli...
This paper introduces an infrastructure for efficiently collecting performance profiles from paralle...
Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terab...
In this work, system monitoring and analysis are discussed in terms of their sig- nificance and bene...
Rapport du stage de Magistère d'informatique préparé parallèlement à la première année de Master d'i...
One key to improving high performance computing (HPC) productivity is to find better ways to measure...
International audienceNowadays, power and energy consumption are of paramount importance. Further, r...
A considerably fraction of science discovery is nowadays relying on computer simulations. High Per...
In this paper we describe the architecture of PerfMC, a performance monitoring system for clusters o...
System and job monitoring are two established way of measuring the utilization of HPC systems. Due t...
Given the complexity of modern HPC systems, achieving theoretical peak performance depends on a myri...
International audienceA new tool and web portal are presented for deployment of High Performance Com...
HPC application developers encounter significant challenges getting their codes to run correctly on ...
Modern parallel systems and applications are constantly increasing in scale and complexity, and cons...
Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jo...