Coupling processors with acceleration hardware is an effective manner to improve energy efficiency of embedded systems. Many-core is nowadays a dominating design paradigm for SoCs, which opens new challenges and opportunities for designing HW blocks. Exploring acceleration solutions that naturally fit into well-established parallel programming models and that can be incrementally added on top of existing parallel applications is thus extremely important. In this paper we focus on tightly-coupled multi-core cluster architectures, representative of the basic building block of the most recent many-cores, and we enhance it with dedicated HW processing units (HWPU). We propose an architecture where the HWPUs share the same L1 data memory through...