Nowadays, more and more scientific fields rely on data mining to produce new results. These raw data are produced at an increasing rate by several tools like DNA sequencers in biology, the Large Hadron Collider (LHC) in physics that produced 25 petabytes per year as of 2012, or the Large Synoptic Survey Telescope (LSST) that should produce 30 petabyte of data per night. High-resolution scanners in medical imaging and social networks also produce huge amounts of data. This data deluge raise several challenges in terms of storage and computer processing. The Google company proposed in 2004 to use the MapReduce model in order to distribute the computation across several computers.This thesis focus mainly on improving the performance of a MapRe...