The growing computational and storage needs of scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L, a 64K dual-core processor system. One of the challenges of designing and deploying such systems in a production setting is the need to take failure occurrences into account. Once the large scale system equipped with a failure predictability, the fault tolerance and resource management strategies of the system can be improved significantly, and its performance can be highly increased. This dissertation is based on the Reliability, Availability and Serviceabilit (RAS) events generated by IBM BlueGene/L over a period of 142 days. Using these logs, we performed failure analysis, modeling, an...
Failure Prediction has long known to be a challenging problem. With the evolving trend of technology...
Abstract—With the growth of system size and complexity, reliability has become of paramount importan...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
The growing computational and storage needs of scientific applications mandate the deploy-ment of ex...
The demand for more computational power in science and engineering has spurred the design and deploy...
Analyzing, understanding and predicting failure is of paramount importance to achieve effective faul...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Abstract—To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, on...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
Failure Prediction has long known to be a challenging problem. With the evolving trend of technology...
Abstract—With the growth of system size and complexity, reliability has become of paramount importan...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
The growing computational and storage needs of scientific applications mandate the deploy-ment of ex...
The demand for more computational power in science and engineering has spurred the design and deploy...
Analyzing, understanding and predicting failure is of paramount importance to achieve effective faul...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Abstract—To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, on...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
We focus on machine failure prediction in industry 4.0.Indeed, it is used for classification problem...
The demands of increasingly large scientific application workflows lead to the need for more powerfu...
Failure Prediction has long known to be a challenging problem. With the evolving trend of technology...
Abstract—With the growth of system size and complexity, reliability has become of paramount importan...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...