In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with ...
Recurrent neural networks have shown remarkable success in modeling sequences. However low resource ...
Connectionist models of sentence processing must learn to behave systematically by generalizing from...
Recently, a new recurrent neural network (RNN) named the Legendre Memory Unit (LMU) was proposed an...
Deep learning has achieved great success in many sequence learning tasks such as machine translation...
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attent...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
In this thesis, we study novel neural network structures to better model long term dependency in seq...
Increasing the capacity of recurrent neural networks (RNN) usually involves augmenting the size...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
With the development of feed-forward models, the default model for sequence modeling has gradually e...
Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine t...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
ABSTRACT We present several modifications of the original recurrent neural network language model (R...
The presence of Long Distance Dependencies (LDDs) in sequential data poses significant challenges fo...
Recurrent neural networks have shown remarkable success in modeling sequences. However low resource ...
Connectionist models of sentence processing must learn to behave systematically by generalizing from...
Recently, a new recurrent neural network (RNN) named the Legendre Memory Unit (LMU) was proposed an...
Deep learning has achieved great success in many sequence learning tasks such as machine translation...
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attent...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
In this thesis, we study novel neural network structures to better model long term dependency in seq...
Increasing the capacity of recurrent neural networks (RNN) usually involves augmenting the size...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
With the development of feed-forward models, the default model for sequence modeling has gradually e...
Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine t...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
ABSTRACT We present several modifications of the original recurrent neural network language model (R...
The presence of Long Distance Dependencies (LDDs) in sequential data poses significant challenges fo...
Recurrent neural networks have shown remarkable success in modeling sequences. However low resource ...
Connectionist models of sentence processing must learn to behave systematically by generalizing from...
Recently, a new recurrent neural network (RNN) named the Legendre Memory Unit (LMU) was proposed an...