Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for large-scale networked multicore systems and cloud computing. This dissertation pursues an effective checkpointing mechanism to handle failures and unavailable events in such systems and thus to reduce the expected job turnaround time, the aggregated file size, and the monetary cost involved. To withstand unavailability/failures of local nodes in networked systems, multi-level checkpointing is indispensable, with checkpoint files kept not only locally but also at remote storage. As the number of nodes in such a system grows, I/O bandwidth to remote storage quickly becomes the bottleneck for multi-level checkpointing. The first part of this wor...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
In cloud computing, users can rent computing resources from service providers according to their dem...
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mech...
International audienceIn this paper, we aim at optimizing fault-tolerance tech- niques based on a ch...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceA non-invasive, cloud-agnostic approach is demonstratedfor extending existing ...
This paper was submitted to IEEE Cloud 2010.Recently introduced spot instances in the Amazon Elastic...
Abstract — Main objective of this research work is to improve the checkpoint efficiency for integrat...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
International audienceEfficient checkpointing of distributed data structures periodically at key mom...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
In cloud computing, users can rent computing resources from service providers according to their dem...
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mech...
International audienceIn this paper, we aim at optimizing fault-tolerance tech- niques based on a ch...
By leveraging the enormous amount of computational capabilities, scientists today are being able to ...
International audienceA non-invasive, cloud-agnostic approach is demonstratedfor extending existing ...
This paper was submitted to IEEE Cloud 2010.Recently introduced spot instances in the Amazon Elastic...
Abstract — Main objective of this research work is to improve the checkpoint efficiency for integrat...
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to re...
International audienceEfficient checkpointing of distributed data structures periodically at key mom...
Transparent hypervisor-level checkpoint-restart mechanisms for virtual clusters (VCs) or clusters of...
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing...
Abstract — Checkpointing is a typical approach to tolerate failures in today’s supercomputing cluste...
Crash and omission failures are common in service providers: a disk can break down or a link can fai...
In cloud computing, users can rent computing resources from service providers according to their dem...