Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergen...
We take a deep look into the behaviour of self-attention heads in the transformer architecture. In l...
Category learning performance is influenced by both the nature of the category's structure and the w...
Transformer networks are able to capture patterns in data coming from many domains (text, images, vi...
Attention-based architectures have become ubiquitous in machine learning, yet our understanding of t...
Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even su...
Recent years have seen the vast potential of the Transformer model, as it is arguably the first gene...
Self-attention, an architectural motif designed to model long-range interactions in sequential data,...
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention m...
Recent trends of incorporating attention mechanisms in vision have led re- searchers to reconsider t...
Transformer networks are able to capture patterns in data coming from many domains (text, images, vi...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
To improve the robustness of transformer neural networks used for temporal-dynamics prediction of ch...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...
We take a deep look into the behaviour of self-attention heads in the transformer architecture. In l...
Category learning performance is influenced by both the nature of the category's structure and the w...
Transformer networks are able to capture patterns in data coming from many domains (text, images, vi...
Attention-based architectures have become ubiquitous in machine learning, yet our understanding of t...
Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even su...
Recent years have seen the vast potential of the Transformer model, as it is arguably the first gene...
Self-attention, an architectural motif designed to model long-range interactions in sequential data,...
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention m...
Recent trends of incorporating attention mechanisms in vision have led re- searchers to reconsider t...
Transformer networks are able to capture patterns in data coming from many domains (text, images, vi...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
To improve the robustness of transformer neural networks used for temporal-dynamics prediction of ch...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...
We take a deep look into the behaviour of self-attention heads in the transformer architecture. In l...
Category learning performance is influenced by both the nature of the category's structure and the w...
Transformer networks are able to capture patterns in data coming from many domains (text, images, vi...