Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and runtime techniques that minimize processors-to-accelerator communication costs. In this paper we present an architecture featuring tightly-coupled processors and accelerators, with zerocopy communication. Efficient programming is supported by an extended OpenMP programming model, where custom directives allow to specialize code regions for execution on parallel cores, accelerators,...