MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for systems research. We characterized resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning techniqu...
Increasingly, large systems and data centers are being built in a 'scale out' manner, i.e. using lar...
A problem commonly faced in Computer Science research is the lack of real usage data that can be use...
Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organi...
MapReduce is a parallel programming model used by Cloud service providers for data mining. To be abl...
International audienceIn HPC community the System Utilization metric enables to determine if the res...
MapReduce is a parallel programming model used by Cloud service providers for data mining. To be abl...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
This paper presents a comprehensive statistical analysis of a variety of workloads collected on prod...
There is an increasing number of MapReduce applications, e.g., personalized advertising, spam detect...
In recent years there has been an extraordinary growth of large-scale data processing and related te...
In the recent years, large-scale data analysis has become critical to the success of modern enterpri...
Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implem...
Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implem...
Abstract—MapReduce is a parallel programming paradigm used for processing huge datasets on certain c...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
Increasingly, large systems and data centers are being built in a 'scale out' manner, i.e. using lar...
A problem commonly faced in Computer Science research is the lack of real usage data that can be use...
Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organi...
MapReduce is a parallel programming model used by Cloud service providers for data mining. To be abl...
International audienceIn HPC community the System Utilization metric enables to determine if the res...
MapReduce is a parallel programming model used by Cloud service providers for data mining. To be abl...
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissat...
This paper presents a comprehensive statistical analysis of a variety of workloads collected on prod...
There is an increasing number of MapReduce applications, e.g., personalized advertising, spam detect...
In recent years there has been an extraordinary growth of large-scale data processing and related te...
In the recent years, large-scale data analysis has become critical to the success of modern enterpri...
Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implem...
Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implem...
Abstract—MapReduce is a parallel programming paradigm used for processing huge datasets on certain c...
Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compar...
Increasingly, large systems and data centers are being built in a 'scale out' manner, i.e. using lar...
A problem commonly faced in Computer Science research is the lack of real usage data that can be use...
Big Data such as Terabyte and Petabyte datasets are rapidly becoming the new norm for various organi...