With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects fail...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...
With petascale computers only a year or two away there is a pressing need to anticipate and compensa...
Abstract. With petascale computers only a year or two away there is a pressing need to anticipate an...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
The emergence of petascale systems and the promise of future exascale systems have reinvigorated the...
A large percentage of computing capacity in todays large high-performance computing systems is waste...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
The current approach to resilience for large high-performance computing (HPC) machines is based on g...
Over the past few years resilience has became a major issue for HPC systems, in particular in the pe...
From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have prof...
Designing highly dependable systems requires a good understanding of failure characteristics. Unfort...
As machines increase in scale, it is predicted that failure rates of supercomputers will correspondi...
Abstract—The era of petascale computing brought machines with hundreds of thousands of processors. T...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
With the increasing presence, scale, and complexity of distributed systems, resource failures are be...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...