In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on ...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Understanding the fundamental mechanism behind the success of transformer networks is still an open ...
Transformers have achieved remarkable success in several domains, ranging from natural language proc...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approa...
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-at...
The general trend in NLP is towards increasing model capacity and performance via deeper neural netw...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
In this work, we study rapid, step-wise improvements of the loss in transformers when being confront...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for lea...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Characterizing neural networks in terms of better-understood formal systems has the potential to yie...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Solid results from Transformers have made them prevailing architectures in various natural language ...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Understanding the fundamental mechanism behind the success of transformer networks is still an open ...
Transformers have achieved remarkable success in several domains, ranging from natural language proc...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approa...
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-at...
The general trend in NLP is towards increasing model capacity and performance via deeper neural netw...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
In this work, we study rapid, step-wise improvements of the loss in transformers when being confront...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for lea...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Characterizing neural networks in terms of better-understood formal systems has the potential to yie...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Solid results from Transformers have made them prevailing architectures in various natural language ...
We evaluate three simple, normalization-centric changes to improve Transformer training. First, we s...
Understanding the fundamental mechanism behind the success of transformer networks is still an open ...
Transformers have achieved remarkable success in several domains, ranging from natural language proc...