Abstract — Scalability models are powerful analytical tools for evaluating and predicting the performance of parallel applica-tions. Unfortunately, existing scalability models do not quantify failure impact and therefore cannot accurately account for application performance in the presence of failures. In this study, we extend two well-known models, namely Amdahl’s law and Gustafson’s law, by considering the impact of failures and the effect of fault tolerance techniques on applications. The derived reliability-aware models can be used to predict application scalability in failure-present environments and evaluate fault tolerance techniques. Trace-based simulations via real failure logs demonstrate that the newly developed models provide a ...
In high performance computing systems, parallel applications request a large number of resources for...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Abstract—Speedup models are powerful analytical tools for evaluating and predicting the performance ...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Fault tolerance has become an important issue for parallel applications in the last few years. The p...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
This paper presents the segregated failures model (SFM) of availability of fault-tolerant computer s...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Continuous technology scaling in semiconductor industry forces reliability as a serious design conce...
This paper presents a method of estimating the availability of fault-tolerant computer systems with ...
Abstract—Supercomputers have seen an exponential increase in their size in the last two decades. Suc...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
In high performance computing systems, parallel applications request a large number of resources for...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...
Abstract—Speedup models are powerful analytical tools for evaluating and predicting the performance ...
International audienceAs supercomputers and clusters increase in size and complexity, system failure...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
Fault tolerance has become an important issue for parallel applications in the last few years. The p...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
This paper presents the segregated failures model (SFM) of availability of fault-tolerant computer s...
Researchers have mentioned that the three most difficult and growing problems in the future of high-...
Continuous technology scaling in semiconductor industry forces reliability as a serious design conce...
This paper presents a method of estimating the availability of fault-tolerant computer systems with ...
Abstract—Supercomputers have seen an exponential increase in their size in the last two decades. Suc...
International audienceProcessor failures in post-petascale parallel computing platforms are common o...
In high performance computing systems, parallel applications request a large number of resources for...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
As supercomputers become larger and more powerful, they are growing increasingly complex. This is re...