A increasingly larger percentage of computing capacity in today's large high-performance computing systems is wasted due to failures and recoveries. Moreover, it is expected that high performance computing will reach exascale within a decade, decreasing the mean time between failures to one day or even a few hours, making fault tolerance a major challenge for the HPC community. As a consequence, current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far, the most popular and used techniques from this field are rollback-recovery protocols. However, existing rollback-recovery techniques have severe scalability limitations and without further optimizations the use of curren...
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted ...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
With the advent of resource-hungry applications such as scientific simulations and artificial intell...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted ...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...
Following the growth of high performance computing systems (HPC) in size and complexity, and the adv...
The need for computer systems to be reliable has increasingly become important as the dependence on ...
Supercomputers have played an essential role in the progress of science and engineering research. As...
2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as da...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
HPC systems are widely used in industrial, economical, and scientific applications, and many of thes...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Transient hardware faults have become one of the major concerns affecting the reliability of modern ...
With the advent of resource-hungry applications such as scientific simulations and artificial intell...
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the n...
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at ...
An important set of challenges emerge as the High Performance Computing (HPC) community aims to rea...
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted ...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...