Although currently available general purpose microprocessors feature more than 10 cores, many programs remain mostly sequential. This can either be due to an inherent property of the algorithm used by the program, to the program being old and written during the uni-processor era, or simply to time to market constraints, as writing and validating parallel code is known to be hard. Moreover, even for parallel programs, the performance of the sequential part quickly becomes the limiting improvement factor as more cores are made available to the application, as expressed by Amdahl's Law. Consequently, increasing sequential performance remains a valid approach in the multi-core era. Unfortunately, conventional means to do so - increasing the out...