Trends in high-performance computing are making it nec-essary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR)- the state of the computation is saved periodically on disk, and when a failure occurs, the compu-tation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument appli-cations for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this pa...
Multiple threads running in a single, shared address space is a simple model for writing parallel pr...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Because of increasing hardware and software complexity, the running time of many computational scie...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Because of increasing hardware and software complexity, the running time of many computational scien...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Multiple threads running in a single, shared address space is a simple model for writing parallel pr...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
Trends in high-performance computing are making it nec-essary for long-running applications to toler...
: We propose a method to incorporate coordinated checkpointing and rollback in high performance comp...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Because of increasing hardware and software complexity, the running time of many computational scie...
Abstract. As modern supercomputing systems reach the peta-flop performance range, they grow in both ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and c...
Because of increasing hardware and software complexity, the running time of many computational scien...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this,...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Multiple threads running in a single, shared address space is a simple model for writing parallel pr...
As the size of high performance clusters multiplies, the prob-ability of system failure grows substa...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...