Diffuser: Efficient Transformers with Multi-Hop Attention Diffusion for Long Sequences

Feng, Aosong
Li, Irene
Jiang, Yuang
Ying, Rex

Open link

Publication date

June 2023

DOI

10.1609/aaai.v37i11.26502

Publisher

Association for the Advancement of Artificial Intelligence

Abstract

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose Diffuser, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory cos...