Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternativ...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by repla...
Prompt-Tuning is a new paradigm for finetuning pre-trained language models in a parameter-efficient ...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
MLP-Mixer has newly appeared as a new challenger against the realm of CNNs and transformer. Despite ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Limited computational budgets often prevent transformers from being used in production and from havi...
Transformers have shown great potential in computer vision tasks. A common belief is their attention...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
All-MLP architectures have attracted increasing interest as an alternative to attention-based models...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by repla...
Prompt-Tuning is a new paradigm for finetuning pre-trained language models in a parameter-efficient ...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
MLP-Mixer has newly appeared as a new challenger against the realm of CNNs and transformer. Despite ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Limited computational budgets often prevent transformers from being used in production and from havi...
Transformers have shown great potential in computer vision tasks. A common belief is their attention...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
All-MLP architectures have attracted increasing interest as an alternative to attention-based models...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...