In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the pr...
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-at...
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-...
Transformer networks have seen great success in natural language processing and machine vision, wher...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Teams that have trained large Transformer-based models have reported training instabilities at large...
Can transformers generalize efficiently on problems that require dealing with examples with differen...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token ge...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Deep learning models such as the Transformer are often constructed by heuristics and experience. To ...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-at...
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-...
Transformer networks have seen great success in natural language processing and machine vision, wher...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Teams that have trained large Transformer-based models have reported training instabilities at large...
Can transformers generalize efficiently on problems that require dealing with examples with differen...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token ge...
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitou...
Deep learning models such as the Transformer are often constructed by heuristics and experience. To ...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-at...
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-...
Transformer networks have seen great success in natural language processing and machine vision, wher...