As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for co- ordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In additio...
Rollback-recovery in distributed systems is important for fault-tolerant computing. Without fault to...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implem...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) p...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Consistent checkpointing provides transparent fault tol erance for longrunning distributed applica...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
In this report, we consider the impact of the consistency model on checkpointing and rollback algori...
Rollback-recovery in distributed systems is important for fault-tolerant computing. Without fault to...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Checkpoint is defined as a designated place in a program at which normal processing is interrupted s...
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implem...
AbstractThe execution times of large-scale parallel applications on modern multi/many-core systems a...
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require un...
This article proposes an original approach that applies the Rollback-Dependency Trackability (RDT) p...
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attracti...
Consistent checkpointing provides transparent fault tol erance for longrunning distributed applica...
The increasing number of cores on current supercomputers will quickly decrease the mean time to fail...
In this report, we consider the impact of the consistency model on checkpointing and rollback algori...
Rollback-recovery in distributed systems is important for fault-tolerant computing. Without fault to...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
Relaxed memory consistency models tolerate increased memory access latency in both hardware and soft...