Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Spe...
An ideal length-extrapolatable Transformer language model can handle sequences longer than the train...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
Transformer-based pretrained language models (LMs) are ubiquitous across natural language understand...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
Transformers have achieved success in both language and vision domains. However, it is prohibitively...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...
Since their release, Transformers have revolutionized many fields from Natural Language Understandin...
Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however s...
State space models (SSMs) have shown impressive results on tasks that require modeling long-range de...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Efficient Transformers have been developed for long sequence modeling, due to their subquadratic mem...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
An ideal length-extrapolatable Transformer language model can handle sequences longer than the train...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
Transformer-based pretrained language models (LMs) are ubiquitous across natural language understand...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
Transformers have achieved success in both language and vision domains. However, it is prohibitively...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...
Since their release, Transformers have revolutionized many fields from Natural Language Understandin...
Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however s...
State space models (SSMs) have shown impressive results on tasks that require modeling long-range de...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Efficient Transformers have been developed for long sequence modeling, due to their subquadratic mem...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
An ideal length-extrapolatable Transformer language model can handle sequences longer than the train...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
Transformer-based pretrained language models (LMs) are ubiquitous across natural language understand...