Data skew, cluster heterogeneity, and network traffic are three issues that significantly influence the performance of MapReduce applications. However, the Hash-Partitioner in native Hadoop does not consider them. This paper proposes a new partitioner in Yarn (Hadoop 2.6.0), namely, PIY, which adopts an innovative parallel sampling method to achieve the distribution of the intermediate data. Based on this, firstly, PIY mitigates data skew in MapReduce applications. Secondly, PIY considers the heterogeneity of the computing resource to balance the load among Reducers. Thirdly, PIY reduces the network traffic in shuffle phase by trying to retain intermediate data on those nodes who act as both mapper and reducer. Compared with the native Hado...
International audienceBig data parallel frameworks, such as MapReduce or Spark have been praised for...
Large quantities of data have been generated from multiple sources at exponential rates in the last ...
In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogene...
Hadoop is a standard implementation of MapReduce framework for running data-intensive applications o...
Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data...
International audienceAlthough MapReduce has been praised for its high scalability and fault toleran...
MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data l...
International audienceNowadyas, we are witnessing the fast production of very large amount of data, ...
Over the last ten years MapReduce has emerged as one of the staples of distributed computing both in...
MapReduce is an effective tool for parallel data processing. One significant issue in practical MapR...
Algorithms for mitigating imbalance of the MapReduce computa-tions are considered in this paper. Map...
As the data growth rate outpace that of the processing capabilities of CPUs, reaching Petascale, tec...
International audienceMapReduce is emerging as a prominent tool for big data processing. Data locali...
MapReduce is a parallel computing model in which a large dataset is split into smaller parts and exe...
The MapReduce framework has become the defacto scheme for scalable semi-structured and un-structured...
International audienceBig data parallel frameworks, such as MapReduce or Spark have been praised for...
Large quantities of data have been generated from multiple sources at exponential rates in the last ...
In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogene...
Hadoop is a standard implementation of MapReduce framework for running data-intensive applications o...
Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data...
International audienceAlthough MapReduce has been praised for its high scalability and fault toleran...
MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data l...
International audienceNowadyas, we are witnessing the fast production of very large amount of data, ...
Over the last ten years MapReduce has emerged as one of the staples of distributed computing both in...
MapReduce is an effective tool for parallel data processing. One significant issue in practical MapR...
Algorithms for mitigating imbalance of the MapReduce computa-tions are considered in this paper. Map...
As the data growth rate outpace that of the processing capabilities of CPUs, reaching Petascale, tec...
International audienceMapReduce is emerging as a prominent tool for big data processing. Data locali...
MapReduce is a parallel computing model in which a large dataset is split into smaller parts and exe...
The MapReduce framework has become the defacto scheme for scalable semi-structured and un-structured...
International audienceBig data parallel frameworks, such as MapReduce or Spark have been praised for...
Large quantities of data have been generated from multiple sources at exponential rates in the last ...
In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogene...