Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Zhang, Qingru
Ram, Dhananjay
Hawkins, Cole
Zha, Sheng
Zhao, Tuo

Publication date

October 2023

Language

English

Abstract

Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Spe...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Abstract

Extracted data

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Abstract

Extracted data

Related items

Related items