Abstract. With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, pro...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
International audienceAbstract With the increasing presence, scale, and complexity of distributed sy...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
International audienceAbstract With the increasing presence, scale, and complexity of distributed sy...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
In the past twenty years, there has been a wealth of theoretical research on minimizing the expected...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
International audienceAbstract With the increasing presence, scale, and complexity of distributed sy...