International audienceFault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work ef...
International audience— As reported by many recent studies, the mean time between failures of future...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audienceFault tolerance is becoming a major concern in HPC systems. The two traditiona...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audience— As reported by many recent studies, the mean time between failures of future...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...
International audienceFault tolerance in MPI becomes a main issue in the HPC community. Several appr...