ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Islamoglu, Gamze
Scherer, Moritz
Paulin, Gianna
Fischer, Tim
Jung, Victor J.B.
Garofalo, Angelo
Benini, Luca

Open link

Publication date

January 2023

DOI

10.1109/ISLPED58423.2023.10244348

Publisher

IEEE

Abstract

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movemen...