We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9$\times$ on Wiki-40B and 12.1$\times$ on PG-19 for auto-regressive language modeling, and 4.8$\times$ on C4 for masked language modeling.Comment: Accepted to the 39t...
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a ...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for lea...
State space models (SSMs) have shown impressive results on tasks that require modeling long-range de...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
Transformer architecture has widespread applications, particularly in Natural Language Processing an...
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by repla...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Transformer models have achieved state-of-the-art results across a diverse range of domains. However...
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a ...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for lea...
State space models (SSMs) have shown impressive results on tasks that require modeling long-range de...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
Transformer architecture has widespread applications, particularly in Natural Language Processing an...
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by repla...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Transformer models have achieved state-of-the-art results across a diverse range of domains. However...
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a ...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...