Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs