It has been claimed that what simplifies parallelism can also simplify resilience. Based on that assertion, we present the Concurrent Collections programming model (CnC) as an ideal target for a simple yet powerful resilience system for parallel computations. Specifically, we claim that the same attributes that simplify reasoning about parallel applications written in CnC will similarly simplify the implementation of a checkpoint/restart system within the CnC runtime. We define these properties of CnC in the context of a model built in K. To demonstrate how these simplifying properties of CnC help to simplify resilience, we have implemented a simple checkpoint/restart system within Rice’s Habanero C implementation of the CnC runtime. We sho...
Unregulated concurrency in functional programs may lead to space demands that exceed available space...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
We introduce the Concurrent Collections (CnC) programming model. In this model, programs are written...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
We introduce the Concurrent Collections (CnC) programming model. CnC supports flexible combinations ...
Resilient objects are instances of distributed abstract data types that are tolerant to failures. D...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
The ever-increasing number of computation units assembled in current HPC platforms leads to a concer...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Traditional checkpoint and recovery are based upon two basic assumptions. The first is the need to h...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Unregulated concurrency in functional programs may lead to space demands that exceed available space...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...
We introduce the Concurrent Collections (CnC) programming model. In this model, programs are written...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
We introduce the Concurrent Collections (CnC) programming model. CnC supports flexible combinations ...
Resilient objects are instances of distributed abstract data types that are tolerant to failures. D...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
The ever-increasing number of computation units assembled in current HPC platforms leads to a concer...
InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user...
textTo make progress in the face of failures, long-running parallel applications need to save their ...
Traditional checkpoint and recovery are based upon two basic assumptions. The first is the need to h...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
International audienceWith increasing scale and complexity of supercomputing and cloud computing arc...
International audienceFailures are increasingly threatening the efficiency of HPC systems, and curre...
Unregulated concurrency in functional programs may lead to space demands that exceed available space...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpointing in a distributed system is essential for recovery to a globally consistent state after...