With the enormous number of computing resources in HPC and Cloud systems, failures become a major concern. Therefore, failure behaviors such as reliability, failure rate, and mean time to failure need to be understood to manage such a large system efficiently. This dissertation makes three major contributions in HPC and Cloud studies. First, a reliability model with correlated failures in a k-node system for HPC applications is studied. This model is extended to improve accuracy by accounting for failure correlation. Marshall-Olkin Multivariate Weibull distribution is improved by excess life, conditional Weibull, to better estimate system reliability. Also, the univariate method is proposed for estimating Marshall-Olkin Multivariate Weibull...
Correctly measuring the reliability and availability of a cloud-based system is critical for evaluat...
The use of cloud computing is extending to all kind of systems, including the ones that are part of...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
The demand for more computational power to solve complex scientific problems has been driving the ph...
Performance of cloud computing depends on effective utilization of resources and reliability. With r...
Resource failures and down times have become a growing concern for large-scale computational platfor...
This dissertation introduces a new metric in the area of High Performance Computing (HPC) applicatio...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
yesFailure in a cloud system is defined as an even that occurs when the delivered service deviates f...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
Cloud computing is increasingly attracting huge attention both in academic research and industry ini...
Dependence of computing resources on each other in cloud computing systems (CCS) makes them prone to...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
Cloud computing has emerged as a platform that grants users with direct yet shared access to remote ...
Correctly measuring the reliability and availability of a cloud-based system is critical for evaluat...
The use of cloud computing is extending to all kind of systems, including the ones that are part of...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...
The demand for more computational power to solve complex scientific problems has been driving the ph...
Performance of cloud computing depends on effective utilization of resources and reliability. With r...
Resource failures and down times have become a growing concern for large-scale computational platfor...
This dissertation introduces a new metric in the area of High Performance Computing (HPC) applicatio...
Resource failures and down times have become a growing concern for large-scale computational platfor...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
yesFailure in a cloud system is defined as an even that occurs when the delivered service deviates f...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
Cloud computing is increasingly attracting huge attention both in academic research and industry ini...
Dependence of computing resources on each other in cloud computing systems (CCS) makes them prone to...
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failur...
Cloud computing has emerged as a platform that grants users with direct yet shared access to remote ...
Correctly measuring the reliability and availability of a cloud-based system is critical for evaluat...
The use of cloud computing is extending to all kind of systems, including the ones that are part of...
Failure is an increasingly important issue in high performance computing and cloud systems. As large...