Increasing the size and complexity of modern HPC systemsalso increases the probability of various types of failures. Failures maydisrupt application execution and waste valuable system resources dueto failed executions. In this work, we explore the eect of node failureson the completion times of MPI parallel jobs. We introduce a simulationenvironment that generates synthetic traces of node failures, assumingthat the times between failures for each node are independently dis-tributed, following the same distribution but with dierent parameters.To highlight the importance of failure-awareness for resource allocation,we compare two failure-oblivious resource allocation approaches withone that considers node failure probabilities before assigni...
AbstractThe growing complexity and size of High Performance Computing systems (HPCs) lead to frequen...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
The demand for more computational power to solve complex scientific problems has been driving the ph...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
Application outages due to node failures are common problems in high performance computing. Reliabil...
After a machine failure, batch schedulers typically reschedule the job that failed with a high prior...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
In high performance computing systems, parallel applications request a large number of resources for...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
Due to the growing size of compute clusters, large scale parallel applications increasingly have to ...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
AbstractThe growing complexity and size of High Performance Computing systems (HPCs) lead to frequen...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
The demand for more computational power to solve complex scientific problems has been driving the ph...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
International audienceThis work provides an analysis of checkpointing strategies for minimizing expe...
Application outages due to node failures are common problems in high performance computing. Reliabil...
After a machine failure, batch schedulers typically reschedule the job that failed with a high prior...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
In high performance computing systems, parallel applications request a large number of resources for...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
Due to the growing size of compute clusters, large scale parallel applications increasingly have to ...
The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. I...
AbstractThe growing complexity and size of High Performance Computing systems (HPCs) lead to frequen...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...