Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for both reactive and proactive mechanisms to tolerate faults. These techniques rely on external components such as system logs and infrastructure monitors to provide information about hardware/software failure either through detection, or as a prediction. However, these middleware work in isolation, without disseminating the knowledge of faults encountered. In this context, we propose a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring using the Intelligent Platform Management Interface (IPMI) a...
on HPC Systems. (Under the direction of Associate Professor Dr. Frank Mueller). Reliability is incre...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
peer reviewedFault-detection and prediction in HPC clusters and Cloud-computing systems are increasi...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Application outages due to node failures are common problems in high performance computing. Reliabil...
High performance computing systems can have high failure rates as they feature a large number of ser...
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with th...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-re...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At ...
on HPC Systems. (Under the direction of Associate Professor Dr. Frank Mueller). Reliability is incre...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
peer reviewedFault-detection and prediction in HPC clusters and Cloud-computing systems are increasi...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Enginee...
Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resi...
Application outages due to node failures are common problems in high performance computing. Reliabil...
High performance computing systems can have high failure rates as they feature a large number of ser...
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with th...
Resiliency of exascale systems has quickly become an important concern for the scientific community....
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-re...
Scientists use advanced computing techniques to assist in answering the complex questions at the for...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At ...
on HPC Systems. (Under the direction of Associate Professor Dr. Frank Mueller). Reliability is incre...
International audienceResilience/fault-tolerance has become a key challenge for large-scale parallel...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...