A large percentage of computing capacity in todays large high-performance computing systems is wasted due to failures and recoveries. The fear in our community is that future Exascale systems will fail so frequently that no useful work will be possible. My research is focusing on characterizing the events generated at the hardware, system or application level by understanding the complex correlations between different system components. This information is used to predict failures and as a consequence to minimize or prevent their effects on running applications. The image represents an overview of the overall analysis process: monitoring applications and their performance, modeling the system and the way anomalies propagate between componen...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
Performance failures are commonplace in most computing environments; without system monitoring they ...
Proceeding of: 2019 IEEE International Conference on Advanced Scientific Computing (ICASC)Exascale s...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Large-scale computing systems provide great po-tential for scientific exploration. However, the comp...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
© 2014 IEEE. As the sizes of supercomputers and data centers grow towards exascale, failures become ...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
Performance failures are commonplace in most computing environments; without system monitoring they ...
Proceeding of: 2019 IEEE International Conference on Advanced Scientific Computing (ICASC)Exascale s...
A increasingly larger percentage of computing capacity in today's large high-performance computing s...
Large-scale computing systems provide great potential for scientific exploration. However, the compl...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effe...
Failures have expensive implications in HPC (High- Performance Computing) systems. Consequently, eff...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...