Memory system design is important for providing high reliability and availability. This dissertation presents a memory architecture to support checkpoints that can improve reliability, and also algorithms to improve recoverable virtual memory. In addition, two novel techniques of reliability analysis are presented that account for program and operating system behavior. Checkpoint and rollback recovery is a method that allows a system to tolerate a failure by periodically saving the state and, if an error occurs, rolling back to the prior checkpoint. A technique is proposed that embeds the support for checkpoint and rollback recovery directly into the virtual memory translation hardware. A system with both highly reliable and normal memory e...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
This dissertation develops a new approach for evaluating the dependability of fault-tolerant compute...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...
This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
As technology feature size continues to shrink, we see two challenging problems in the design of com...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
One of the fundamental limits to high-performance, high-reliability applications is memory's vulnera...
In this paper, we study real-time in-memory checkpointing as an effective means to improve the relia...
Aggressive process scaling and increasing demands of performance/cost efficiency have exacerbated th...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
The research investigated under this grant: (I.) a virtual checkpointing scheme for recovery, (2) sc...
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of fail...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
This dissertation develops a new approach for evaluating the dependability of fault-tolerant compute...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...
This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in...
As technology feature size continues to shrink, we see two challenging problems in the design of com...
International audienceAs High Performance platforms (Clusters, Grids, etc.) continue to grow in size...
High performance computing applications must be tolerant to faults, which are common occurrences esp...
One of the fundamental limits to high-performance, high-reliability applications is memory's vulnera...
In this paper, we study real-time in-memory checkpointing as an effective means to improve the relia...
Aggressive process scaling and increasing demands of performance/cost efficiency have exacerbated th...
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpo...
The research investigated under this grant: (I.) a virtual checkpointing scheme for recovery, (2) sc...
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of fail...
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
This dissertation develops a new approach for evaluating the dependability of fault-tolerant compute...
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory mu...