Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) architecture. A unique feature of the CEVM is the provision of an efficient mechanism for checkpointing Virtual Machines in hypervisor-based High Performance Computing (HPC) systems. Our goals are (1) to enable implicit system-level fault tolerance without modifying existing operating systems, applications or hardware, and (2) to minimize the space and time overhead needed to execute software that cannot tolerate faults. We accomplish these goals by leveraging hypervisor technologies that yielded tremendous reliability and productivity improvements in the HPC community. In this paper, we discuss the two novel protocols used in the VM checkpoint...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
AbstractThe paper discusses the constructive framework for writing hypervisor on the top of the VM. ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large s...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Hypervisor-based fault tolerance (HBFT), which synchronizes the state between the primary VM and the...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Critical applications require dependability mechanisms to prevent them from failuresdue to faults. D...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
Checkpointing, i.e., recording the volatile state of a virtual machine (VM) running as a guest in a ...
This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (...
Protocols to implement a fault-tolerant computing system are described. These protocols augment the ...
Virtual machine checkpoints provide a clean encapsulation of the full state of an executing system....
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
AbstractThe paper discusses the constructive framework for writing hypervisor on the top of the VM. ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large s...
Abstract. Large-scale computing platforms provide tremendous capabilities for scientific discovery. ...
Hypervisor-based fault tolerance (HBFT), which synchronizes the state between the primary VM and the...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
Critical applications require dependability mechanisms to prevent them from failuresdue to faults. D...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
Checkpointing, i.e., recording the volatile state of a virtual machine (VM) running as a guest in a ...
This study explores a recovery strategy using checkpointing in a distributed shared virtual memory (...
Protocols to implement a fault-tolerant computing system are described. These protocols augment the ...
Virtual machine checkpoints provide a clean encapsulation of the full state of an executing system....
Operating systems and hypervisors enable the collection and extraction of rich information on applic...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
AbstractThe paper discusses the constructive framework for writing hypervisor on the top of the VM. ...
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increas...