As the number of CPU cores in high-performance computing platforms continues to grow, the availability and reliability of these systems become a primary concern. As such, some solutions are physical (ie. power backup) and some are software driven. Lawrence Berkeley National Laboratory has created a system-level fault-tolerant checkpoint/restart implementation for Linux Clusters. This allows processes to restart computations at the last known checkpoint in the event the system crashes. The checkpoint data creation is highly dependent on system input and output operations. This paper proposes: (i) a technique to improve the efficiency of these I/O operations and (ii) an alternative checkpoint creation method to increase availability and relia...
The use of Commercial Off-The-Shelf (COTS) processors is increasingly attractive for the space domai...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Large machines with tens or even hundreds of thousands of processors are currently in use. As the nu...
Computational power demand for large challenging problems has increasingly driven the physical size ...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
The use of Commercial Off-The-Shelf (COTS) processors is increasingly attractive for the space domai...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Large machines with tens or even hundreds of thousands of processors are currently in use. As the nu...
Computational power demand for large challenging problems has increasingly driven the physical size ...
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2010Scientists use advanced computing techni...
Finding the failure rate of a system is a crucial step in high performance computing systems analysi...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de...
We present an in-depth analysis of the crash-recovery problem and propose a novel approach to recove...
This paper describes a checkpoint comparison and optimistic execution technique for error detection ...
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high per...
The use of Commercial Off-The-Shelf (COTS) processors is increasingly attractive for the space domai...
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...