Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. We set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experime...
Abstract—This work presents models characterizing failures observed during the execution of large sc...
Employee turnover is a serious challenge for organizations and companies. Thus, the prediction of em...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
Large high-performance computing systems are built with increasing number of components with more CP...
The paper is devoted to machine learning methods and algorithms for the supercomputer jobs executio...
Modern applications, such as smart cities, home automation, and eHealth, demand a new approach to im...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
A science gateway is a web-based interface that provides access to High Performance Computing (HPC) ...
Cloud Services are the on-demand availability of resources like storage, data, and compute power. No...
In this research, we investigated two approaches to detect job anomalies and/or contention for large...
Doctor of PhilosophyDepartment of Computer ScienceDaniel A. AndresenOverestimation of High Performan...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our...
Abstract— At academic institutions, student placement is critical. It is the decisive element in adm...
Abstract—This work presents models characterizing failures observed during the execution of large sc...
Employee turnover is a serious challenge for organizations and companies. Thus, the prediction of em...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...
Large high-performance computing systems are built with increasing number of components with more CP...
The paper is devoted to machine learning methods and algorithms for the supercomputer jobs executio...
Modern applications, such as smart cities, home automation, and eHealth, demand a new approach to im...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
A science gateway is a web-based interface that provides access to High Performance Computing (HPC) ...
Cloud Services are the on-demand availability of resources like storage, data, and compute power. No...
In this research, we investigated two approaches to detect job anomalies and/or contention for large...
Doctor of PhilosophyDepartment of Computer ScienceDaniel A. AndresenOverestimation of High Performan...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
YesFailure is an increasingly important issue in high performance computing and cloud systems. As la...
GPUs are highly contended resources in shared clusters for deep learning (DL) training. However, our...
Abstract— At academic institutions, student placement is critical. It is the decisive element in adm...
Abstract—This work presents models characterizing failures observed during the execution of large sc...
Employee turnover is a serious challenge for organizations and companies. Thus, the prediction of em...
Cloud failure is one of the critical issues since it can cost millions of dollars to cloud service p...