Statistical Fault Detection for Parallel Applications with AutomaDeD

Bronevetsky, G
Laguna, I
Bagchi, S
de Supinski, B R
Ahn, D
Schulz, M

Publication date

March 2010

Publisher

Lawrence Livermore National Laboratory

Abstract

Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately det...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Statistical Fault Detection for Parallel Applications with AutomaDeD

Abstract

Extracted data

Statistical Fault Detection for Parallel Applications with AutomaDeD

Abstract

Extracted data

Related items

Related items