Large scale computer clusters have during the last years become dominant for making computations in applications where extremely high computation capacity is required. The clusters consist of a large set of normal servers, interconnected with a fast network. As each node runs its own instance of the operating system, and each node is working, in that sense autonomously, supervising the whole cluster is a challenge. To get an overview of the efficency and utilization of the system, one cannot only look at one computer. It is necessary to monitor all nodes to get a good view on how the cluster behaves. Monitoring performance counters in a large scale computation cluster implies many difficulties. How can samples of performance metrics be made...
The evolution of parallel and distributed architectures and programming paradigms for performance-or...
The use of a cluster for distributed performance analy-sis of parallel trace data is discussed. We p...
We present a monitoring system for large-scale parallel and distributed computing environments that ...
Cluster became main platform as parallel and distributed computing structure for high performance co...
In this paper we describe the architecture of PerfMC, a performance monitoring system for clusters o...
Parallel architectures, like the transputer-based multicomputer network, offer potentially enormous...
Robust high throughput computing requires effective mon-itoring and enforcement of a variety of reso...
Effective management and utilization of large com-puter clusters requires suitable management tools....
The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating...
Systems for high performance computing are getting increasingly complex. On the one hand, the number...
Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Moder...
Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terab...
The constant monitoring of a computer is one of the essentials to be up-to-date about its state. Thi...
Abstract—Robust high throughput computing requires ef-fective monitoring and enforcement of a variet...
In this paper, we present a structure for monitoring a large set of computational clusters. We illus...
The evolution of parallel and distributed architectures and programming paradigms for performance-or...
The use of a cluster for distributed performance analy-sis of parallel trace data is discussed. We p...
We present a monitoring system for large-scale parallel and distributed computing environments that ...
Cluster became main platform as parallel and distributed computing structure for high performance co...
In this paper we describe the architecture of PerfMC, a performance monitoring system for clusters o...
Parallel architectures, like the transputer-based multicomputer network, offer potentially enormous...
Robust high throughput computing requires effective mon-itoring and enforcement of a variety of reso...
Effective management and utilization of large com-puter clusters requires suitable management tools....
The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating...
Systems for high performance computing are getting increasingly complex. On the one hand, the number...
Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Moder...
Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terab...
The constant monitoring of a computer is one of the essentials to be up-to-date about its state. Thi...
Abstract—Robust high throughput computing requires ef-fective monitoring and enforcement of a variet...
In this paper, we present a structure for monitoring a large set of computational clusters. We illus...
The evolution of parallel and distributed architectures and programming paradigms for performance-or...
The use of a cluster for distributed performance analy-sis of parallel trace data is discussed. We p...
We present a monitoring system for large-scale parallel and distributed computing environments that ...