Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, ...
Deep learning models such as the Transformer are often constructed by heuristics and experience. To ...
In this work, we study rapid, step-wise improvements of the loss in transformers when being confront...
Despite the widespread success of Transformers on NLP tasks, recent works have found that they strug...
Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL), despite...
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-...
Transformer networks have seen great success in natural language processing and machine vision, wher...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Transformers have become an important workhorse of machine learning, with numerous applications. Thi...
Mixture models arise in many regression problems, but most methods have seen limited adoption partly...
Transformers have achieved remarkable success in several domains, ranging from natural language proc...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Understanding the fundamental mechanism behind the success of transformer networks is still an open ...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for lea...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Deep learning models such as the Transformer are often constructed by heuristics and experience. To ...
In this work, we study rapid, step-wise improvements of the loss in transformers when being confront...
Despite the widespread success of Transformers on NLP tasks, recent works have found that they strug...
Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL), despite...
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-...
Transformer networks have seen great success in natural language processing and machine vision, wher...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Transformers have become an important workhorse of machine learning, with numerous applications. Thi...
Mixture models arise in many regression problems, but most methods have seen limited adoption partly...
Transformers have achieved remarkable success in several domains, ranging from natural language proc...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Understanding the fundamental mechanism behind the success of transformer networks is still an open ...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for lea...
In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers un...
Deep learning models such as the Transformer are often constructed by heuristics and experience. To ...
In this work, we study rapid, step-wise improvements of the loss in transformers when being confront...
Despite the widespread success of Transformers on NLP tasks, recent works have found that they strug...