Abstract—Job scheduling on large-scale systems is an in-creasingly complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisti-cated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling frame-work to balance various scheduling requirements and priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces...
High performance computing clusters provide an efficient and cost effective solution to tackle large...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
In scheduling a large number of user jobs for parallel execution on an open-resource Grid system, th...
Abstract—As systems scale toward exascale, many resources will become increasingly constrained. Whil...
Resource management and job scheduling is a crucial task on large-scale computing systems. Despite y...
Desktop Grids have proved to be a suitable platform for the execution of Bag-of-Tasks applications b...
System administrators for parallel computers face many difficulties when managing job scheduling sys...
Job checkpointing is one of the most common utilized techniques for providing fault tolerance in com...
Desktop Grids have proved to be a suitable platform for the execution of Bag-of-Tasks applications b...
The Conference program's website is located at https://ifac.papercept.net/conferences/conferences/IF...
Torus-connected network is widely used in modern supercomputers due to its linear per node cost scal...
The major GRID infastructures are designed mainly for batch-oriented computing with coarse-grained j...
Real-time systems are being extensively used in applications that are mission-critical and life-crit...
Scheduling jobs on HPC systems can impact performance, efficiency, and utilization greatly. It can b...
This article reviews the production scheduling problems focusing on those related to flexible job-sh...
High performance computing clusters provide an efficient and cost effective solution to tackle large...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
In scheduling a large number of user jobs for parallel execution on an open-resource Grid system, th...
Abstract—As systems scale toward exascale, many resources will become increasingly constrained. Whil...
Resource management and job scheduling is a crucial task on large-scale computing systems. Despite y...
Desktop Grids have proved to be a suitable platform for the execution of Bag-of-Tasks applications b...
System administrators for parallel computers face many difficulties when managing job scheduling sys...
Job checkpointing is one of the most common utilized techniques for providing fault tolerance in com...
Desktop Grids have proved to be a suitable platform for the execution of Bag-of-Tasks applications b...
The Conference program's website is located at https://ifac.papercept.net/conferences/conferences/IF...
Torus-connected network is widely used in modern supercomputers due to its linear per node cost scal...
The major GRID infastructures are designed mainly for batch-oriented computing with coarse-grained j...
Real-time systems are being extensively used in applications that are mission-critical and life-crit...
Scheduling jobs on HPC systems can impact performance, efficiency, and utilization greatly. It can b...
This article reviews the production scheduling problems focusing on those related to flexible job-sh...
High performance computing clusters provide an efficient and cost effective solution to tackle large...
Abstract—As the scale of parallel systems continues to grow, fault management of these systems is be...
In scheduling a large number of user jobs for parallel execution on an open-resource Grid system, th...