The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. In this work, we enhance the architecture of a monitoring system used on large-scale computational platforms, to integrate streaming analysis capabilities a...
Performance analysis tools allow application developers to identify and characterize the inefficienc...
Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jo...
We present a monitoring system for large-scale parallel and distributed computing environments that ...
In this work, system monitoring and analysis are discussed in terms of their sig- nificance and bene...
Linux currently plays an important role in high-end computing systems, but re-cent work has shown th...
Performance monitoring of HPC applications offers opportunities for adaptive optimization based on d...
Networks are the backbone of modern HPC systems. They serve as a critical piece of infrastructure, t...
Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can p...
The purpose of this project was to build an extensible cross-platform infrastructure to facilitate t...
Large scale computer clusters have during the last years become dominant for making computations in ...
Biologists doing high-throughput high-content cellular analysis are generally not computer scientist...
There is a variety of tools to measure the performance of Linux systems and the applications running...
The HPC system consists of a set of layers of software and hardware for I/O and networking. System l...
International audienceNowadays, power and energy consumption are of paramount importance. Further, r...
Abstract—The current trend in high performance comput-ing is to aggregate ever larger numbers of pro...
Performance analysis tools allow application developers to identify and characterize the inefficienc...
Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jo...
We present a monitoring system for large-scale parallel and distributed computing environments that ...
In this work, system monitoring and analysis are discussed in terms of their sig- nificance and bene...
Linux currently plays an important role in high-end computing systems, but re-cent work has shown th...
Performance monitoring of HPC applications offers opportunities for adaptive optimization based on d...
Networks are the backbone of modern HPC systems. They serve as a critical piece of infrastructure, t...
Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can p...
The purpose of this project was to build an extensible cross-platform infrastructure to facilitate t...
Large scale computer clusters have during the last years become dominant for making computations in ...
Biologists doing high-throughput high-content cellular analysis are generally not computer scientist...
There is a variety of tools to measure the performance of Linux systems and the applications running...
The HPC system consists of a set of layers of software and hardware for I/O and networking. System l...
International audienceNowadays, power and energy consumption are of paramount importance. Further, r...
Abstract—The current trend in high performance comput-ing is to aggregate ever larger numbers of pro...
Performance analysis tools allow application developers to identify and characterize the inefficienc...
Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jo...
We present a monitoring system for large-scale parallel and distributed computing environments that ...