Failures are common in today’s data center environment and can significantly impact the performance of important jobs running on top of large scale computing frameworks. In this paper we analyze Hadoop’s behavior under compute node and process failures. Surprisingly, we find that even a single failure can have a large detrimental effect on job running times. We uncover several important design decisions underlying this distressing behavior: the inefficiency of Hadoop’s statistical speculative execution algorithm, the lack of sharing failure information and the overloading of TCP failure semantics. We hope that our study will add new dimensions to the pursuit of robust large scale computing framework designs
The analysis and modeling of the failures bound to occur in today's large-scale production systems i...
International audienceThe analysis and modeling of the failures bound to occur in today's large-scal...
SALSA examines system logs to derive state-machine views of the sytem’s execution, along with contro...
Hadoop has become a critical component in today’s cloud environ-ment. Ensuring good performance for ...
International audienceLarge-scale data analysis has increasingly come to rely on MapReduce and its o...
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
Abstract. The increasing use of computing resources in our daily lives leads to data being gener-ate...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
Due to the growing size of compute clusters, large scale parallel applications increasingly have to ...
Big data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today because they grea...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
AbstractThe growing complexity and size of High Performance Computing systems (HPCs) lead to frequen...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Abstract- Hadoop YARN is a software framework that supports data intensive distributed application. ...
The analysis and modeling of the failures bound to occur in today's large-scale production systems i...
International audienceThe analysis and modeling of the failures bound to occur in today's large-scal...
SALSA examines system logs to derive state-machine views of the sytem’s execution, along with contro...
Hadoop has become a critical component in today’s cloud environ-ment. Ensuring good performance for ...
International audienceLarge-scale data analysis has increasingly come to rely on MapReduce and its o...
Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastro...
Abstract. The increasing use of computing resources in our daily lives leads to data being gener-ate...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
Due to the growing size of compute clusters, large scale parallel applications increasingly have to ...
Big data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today because they grea...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
Distributed computing environments are increasingly deployed over geographically spanning data cente...
AbstractThe growing complexity and size of High Performance Computing systems (HPCs) lead to frequen...
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliv...
Abstract- Hadoop YARN is a software framework that supports data intensive distributed application. ...
The analysis and modeling of the failures bound to occur in today's large-scale production systems i...
International audienceThe analysis and modeling of the failures bound to occur in today's large-scal...
SALSA examines system logs to derive state-machine views of the sytem’s execution, along with contro...