International audience—Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted to frequent failures, and resilience techniques must be employed to ensure the completion of large applications. Indeed, failures may create severe imbalance between applications, and significantly degrade performance. In this paper, we propose to redistribute the resources assigned to each application upon the striking of failures, in order to minimize the expected completion time of a set of co-scheduled applications. First, we introduce a formal model and establish complexity results. When no redistribution ...
As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, ...
This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) p...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
International audience—Recently, the benefits of co-scheduling several applications have been demons...
This thesis explores co-scheduling problems in the context of large-scale applications with two main...
International audienceThis paper investigates co-scheduling algorithms for processing a set of paral...
This thesis consists of two parts: performance bounds for scheduling algorithms for parallel program...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Large scale systems provide a powerful computing platform for solving large and complex scientific a...
Emerging architecture designs include tens of processing cores on a single chip die; it is believed ...
We study the scheduling of computational workflows on compute resources thatexperience exponentially...
Proc. of the 37th IEEE Intenational Conference on parallel Processing (ICPP 2008) IEEE Computer Soci...
As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, ...
This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) p...
This thesis focuses on resilience for high performance applications that execute on large scale plat...
International audience—Recently, the benefits of co-scheduling several applications have been demons...
This thesis explores co-scheduling problems in the context of large-scale applications with two main...
International audienceThis paper investigates co-scheduling algorithms for processing a set of paral...
This thesis consists of two parts: performance bounds for scheduling algorithms for parallel program...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
International audienceHigh performance computing applications must be resilient to faults. The tradi...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
Large scale systems provide a powerful computing platform for solving large and complex scientific a...
Emerging architecture designs include tens of processing cores on a single chip die; it is believed ...
We study the scheduling of computational workflows on compute resources thatexperience exponentially...
Proc. of the 37th IEEE Intenational Conference on parallel Processing (ICPP 2008) IEEE Computer Soci...
As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, ...
This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) p...
This thesis focuses on resilience for high performance applications that execute on large scale plat...