We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is strikingly simple. It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens. Our design was inspired in part by LSTM cells, and it uses LSTM-style gates, but it scales the typical LSTM cell up by several orders of magnitude. Our implementation o...
Training large transformer models is one of the most important computational challenges of modern AI...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attent...
State space models (SSMs) have shown impressive results on tasks that require modeling long-range de...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language ...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Deep learning has achieved great success in many sequence learning tasks such as machine translation...
Transformers in their common form are inherently limited to operate on whole token sequences rather ...
Originally developed for natural language problems, transformer models have recently been widely use...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
The transformer architecture and variants presented remarkable success across many machine learning ...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
Training large transformer models is one of the most important computational challenges of modern AI...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attent...
State space models (SSMs) have shown impressive results on tasks that require modeling long-range de...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language ...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Deep learning has achieved great success in many sequence learning tasks such as machine translation...
Transformers in their common form are inherently limited to operate on whole token sequences rather ...
Originally developed for natural language problems, transformer models have recently been widely use...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
The transformer architecture and variants presented remarkable success across many machine learning ...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
Training large transformer models is one of the most important computational challenges of modern AI...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Recent work has shown that either (1) increasing the input length or (2) increasing model size can i...