State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show ...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
LSTMs and other RNN variants have shown strong performance on character-level language modeling. The...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Transformers have achieved success in both language and vision domains. However, it is prohibitively...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...
State space models have shown to be effective at modeling long range dependencies, specially on sequ...
Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of a...
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attent...
Transformer encoder-decoder models have shown impressive performance in dialogue modeling. However, ...
Transformer models have achieved promising results on natural language processing (NLP) tasks includ...
Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however s...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
LSTMs and other RNN variants have shown strong performance on character-level language modeling. The...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Transformers have achieved success in both language and vision domains. However, it is prohibitively...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...
State space models have shown to be effective at modeling long range dependencies, specially on sequ...
Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of a...
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attent...
Transformer encoder-decoder models have shown impressive performance in dialogue modeling. However, ...
Transformer models have achieved promising results on natural language processing (NLP) tasks includ...
Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however s...
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks....
LSTMs and other RNN variants have shown strong performance on character-level language modeling. The...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...