We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to...
The great success of transformer-based models in natural language processing (NLP) has led to variou...
Given a large Transformer model, how can we obtain a small and computationally efficient model which...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Transformer-based architectures are the model of choice for natural language understanding, but they...
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward N...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
In this paper, we propose that the dot product pairwise matching attention layer, which is widely us...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Recently, large-scale transformer-based models have been proven to be effective over various tasks a...
Characterizing neural networks in terms of better-understood formal systems has the potential to yie...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize...
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressiv...
The great success of transformer-based models in natural language processing (NLP) has led to variou...
Given a large Transformer model, how can we obtain a small and computationally efficient model which...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Transformer-based architectures are the model of choice for natural language understanding, but they...
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward N...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
In this paper, we propose that the dot product pairwise matching attention layer, which is widely us...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Recently, large-scale transformer-based models have been proven to be effective over various tasks a...
Characterizing neural networks in terms of better-understood formal systems has the potential to yie...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize...
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressiv...
The great success of transformer-based models in natural language processing (NLP) has led to variou...
Given a large Transformer model, how can we obtain a small and computationally efficient model which...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...