Analyzing, understanding and predicting failure is of paramount importance to achieve effective fault manage-ment. While various fault prediction methods have been studied in the past, many of them are not practical for use in real systems. In particular, they fail to address two crucial issues: one is to provide location information (i.e., the com-ponents where the failure is expected to occur on) and the other is to provide sufficient lead time (i.e., the time interval preceding the time of failure occurrence). In this paper, we first refine the widely-used metrics for evaluating prediction accuracy by including location as well as lead time. We, then, present a practical failure prediction mechanism for IBM Blue Gene systems. A Genetic A...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Online failure prediction is an approach that aims to increase system reliability by predicting pend...
Equipment failures of large and complex safety-critical plants are unavoid-able. The forthcoming fau...
Abstract—To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, on...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
The growing computational and storage needs of scientific applications mandate the deploy-ment of ex...
The demand for more computational power in science and engineering has spurred the design and deploy...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
In this paper, we present the Framework for building Failure Prediction Models ((FPM)-P-2), a Machin...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Failure prediction is one of the key challenges that have to be mastered for a new arena of fault to...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
The sudden downtime and unplanned maintenance not only drastically increase the maintenance cost but...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Online failure prediction is an approach that aims to increase system reliability by predicting pend...
Equipment failures of large and complex safety-critical plants are unavoid-able. The forthcoming fau...
Abstract—To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, on...
The growing computational and storage needs of scientific applications mandate the deployment of ext...
The growing computational and storage needs of scientific applications mandate the deploy-ment of ex...
The demand for more computational power in science and engineering has spurred the design and deploy...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
In this paper, we present the Framework for building Failure Prediction Models ((FPM)-P-2), a Machin...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Failure prediction is one of the key challenges that have to be mastered for a new arena of fault to...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
The sudden downtime and unplanned maintenance not only drastically increase the maintenance cost but...
With the growth of system size and complexity, reliability has become a major concern for large-scal...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Online failure prediction is an approach that aims to increase system reliability by predicting pend...
Equipment failures of large and complex safety-critical plants are unavoid-able. The forthcoming fau...