International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both ap...
As high performance computing centers (HPCC) continue to grow in popularity, issues of resource mana...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) ar...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large s...
International audienceA non-invasive, cloud-agnostic approach is demonstratedfor extending existing ...
Checkpoint can store and recovery applications when faults happen and is becoming critical to large ...
Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for ...
Abstract—Virtualization has been widely adopted in recent years in the cloud computing platform to i...
As high performance computing centers (HPCC) continue to grow in popularity, issues of resource mana...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
International audienceInfrastructure-as-a-Service (IaaS) cloud computing is gaining significant inte...
Abstract- In this work, we present the design of the Checkpointing-Enabled Virtual Machine (CEVM) ar...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
The efficient utilization of current supercomputing systems with deep storage hierarchies demands sc...
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large s...
International audienceA non-invasive, cloud-agnostic approach is demonstratedfor extending existing ...
Checkpoint can store and recovery applications when faults happen and is becoming critical to large ...
Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for ...
Abstract—Virtualization has been widely adopted in recent years in the cloud computing platform to i...
As high performance computing centers (HPCC) continue to grow in popularity, issues of resource mana...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) improves scientific prod...