Abstract—The current trend in high performance comput-ing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These ...
Performance prediction of checkpointing systems in the presence of failures is a well-studied resear...
The use of increasingly complex hardware and software platforms in response to the ever rising perfo...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Application requirements in High-Performance Computing (HPC) are becoming increasingly exacting, and...
In this work, system monitoring and analysis are discussed in terms of their sig- nificance and bene...
Application requirements in High-Performance Computing (HPC) are becoming increasingly exacting, and...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Resource failures and down times have become a growing concern for large-scale computational platfor...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
The performance monitoring unit (PMU) in multiprocessor system-on-chips (MPSoCs) is at the heart of ...
The growth of High Performance Computer (HPC) systems increases the complexity with respect to under...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Taufer, MichelaHigh performance computing (HPC) is undergoing many changes at both the system and wo...
The pressing market demand for competitive performance/cost ratios compels Critical Real-Time Embedd...
Performance prediction of checkpointing systems in the presence of failures is a well-studied resear...
The use of increasingly complex hardware and software platforms in response to the ever rising perfo...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Application requirements in High-Performance Computing (HPC) are becoming increasingly exacting, and...
In this work, system monitoring and analysis are discussed in terms of their sig- nificance and bene...
Application requirements in High-Performance Computing (HPC) are becoming increasingly exacting, and...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Resource failures and down times have become a growing concern for large-scale computational platfor...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
As high performance computing (HPC) systems grow larger, with increasing numbers of components, fail...
The performance monitoring unit (PMU) in multiprocessor system-on-chips (MPSoCs) is at the heart of ...
The growth of High Performance Computer (HPC) systems increases the complexity with respect to under...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Taufer, MichelaHigh performance computing (HPC) is undergoing many changes at both the system and wo...
The pressing market demand for competitive performance/cost ratios compels Critical Real-Time Embedd...
Performance prediction of checkpointing systems in the presence of failures is a well-studied resear...
The use of increasingly complex hardware and software platforms in response to the ever rising perfo...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...