International audienceA non-invasive, cloud-agnostic approach is demonstratedfor extending existing cloud platforms to includecheckpoint-restart capability. Most cloud platforms currently relyon each application to provide its own fault tolerance. A uniformmechanism within the cloud itself serves two purposes: (a) directsupport for long-running jobs, which would otherwise require acustom fault-tolerant mechanism for each application; and (b) theadministrative capability to manage an over-subscribed cloudby temporarily swapping out jobs when higher priority jobsarrive. An advantage of this uniform approach is that it alsosupports parallel and distributed computations, over both TCPand InfiniBand, thus allowing traditional HPC applications to...