In high performance computing systems, parallel applications request a large number of resources for long time periods. In this scenario, if a resource fails during the application runtime, it would cause all applications using this resource to fail. The probability of application failure is tied to the inherent reliability of resources used by the application. Our investigation of high performance computing systems operating in the field has revealed a significant difference in the measured operational reliability of individual computing nodes. By adding awareness of the individual system nodes\u27 reliability to the scheduler along with the predicted reliability needs of parallel applications, reliable resources can be matched with the mo...
International audienceApplications implemented on critical systems are subject to both safety critic...
Increasing the size and complexity of modern HPC systemsalso increases the probability of various ty...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
The demand for more computational power to solve complex scientific problems has been driving the ph...
The growing demand for more computational power to solve complex scientific problems is driving the ...
High performance computing clusters provide an efficient and cost effective solution to tackle large...
In large-scale grid environments, accurate failure prediction is critical to achieve effective resou...
Resource failures and down times have become a growing concern for large-scale computational platfor...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Lately, distributed computing (DC) has emerged in several application scenarios such as grid comput...
Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures ar...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
Present and future Computational applications require massively parallel processors- Top500.org re...
International audienceApplications implemented on critical systems are subject to both safety critic...
Increasing the size and complexity of modern HPC systemsalso increases the probability of various ty...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
The demand for more computational power to solve complex scientific problems has been driving the ph...
The growing demand for more computational power to solve complex scientific problems is driving the ...
High performance computing clusters provide an efficient and cost effective solution to tackle large...
In large-scale grid environments, accurate failure prediction is critical to achieve effective resou...
Resource failures and down times have become a growing concern for large-scale computational platfor...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Lately, distributed computing (DC) has emerged in several application scenarios such as grid comput...
Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures ar...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
Present and future Computational applications require massively parallel processors- Top500.org re...
International audienceApplications implemented on critical systems are subject to both safety critic...
Increasing the size and complexity of modern HPC systemsalso increases the probability of various ty...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...