Algorithmic generalization in machine learning refers to the ability to learn the underlying algorithm that generates data in a way that generalizes out-of-distribution. This is generally considered a difficult task for most machine learning algorithms. Here, we analyze algorithmic generalization when counting is required, either implicitly or explicitly. We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance for such tasks. In particular, we discuss the consequences of using layer normalization and of normalizing the attention weights via softmax. With ablation of the problematic operations, we demonstrate that a modified transformer can exhibit a good algorithmic generalization ...
Gradient-based deep-learning algorithms exhibit remarkable performance in practice, but it is not we...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
This thesis is concerned with the topic of generalization in large, over-parameterized machine learn...
Transformer networks have seen great success in natural language processing and machine vision, wher...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
Can transformers generalize efficiently on problems that require dealing with examples with differen...
Despite progress across a broad range of applications, Transformers have limited success in systemat...
Out-of-distribution generalization (OODG) is a longstanding challenge for neural networks. This chal...
Since its introduction, the transformer model has demonstrated outstanding performance across variou...
This thesis presents a new theory of generalization in neural network types of learning machines. Th...
Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neu...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
This paper provides theoretical insights into why and how deep learning can generalize well, despite...
A machine learning (ML) system must learn not only to match the output of a target function on a tra...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Gradient-based deep-learning algorithms exhibit remarkable performance in practice, but it is not we...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
This thesis is concerned with the topic of generalization in large, over-parameterized machine learn...
Transformer networks have seen great success in natural language processing and machine vision, wher...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
Can transformers generalize efficiently on problems that require dealing with examples with differen...
Despite progress across a broad range of applications, Transformers have limited success in systemat...
Out-of-distribution generalization (OODG) is a longstanding challenge for neural networks. This chal...
Since its introduction, the transformer model has demonstrated outstanding performance across variou...
This thesis presents a new theory of generalization in neural network types of learning machines. Th...
Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neu...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
This paper provides theoretical insights into why and how deep learning can generalize well, despite...
A machine learning (ML) system must learn not only to match the output of a target function on a tra...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Gradient-based deep-learning algorithms exhibit remarkable performance in practice, but it is not we...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
This thesis is concerned with the topic of generalization in large, over-parameterized machine learn...