Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired ...
The deep learning architecture associated with ChatGPT and related generative AI products is known a...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
The transformer is a neural network component that can be used to learn useful representations of se...
Despite progress across a broad range of applications, Transformers have limited success in systemat...
Can transformers generalize efficiently on problems that require dealing with examples with differen...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Algorithmic generalization in machine learning refers to the ability to learn the underlying algorit...
When trained on language data, do transformers learn some arbitrary computation that utilizes the fu...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
Structure prediction (SP) tasks are important in natural language understanding in the sense that th...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Humans can systematically generalize to novel compositions of existing concepts. Recent studies argu...
The deep learning architecture associated with ChatGPT and related generative AI products is known a...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
The transformer is a neural network component that can be used to learn useful representations of se...
Despite progress across a broad range of applications, Transformers have limited success in systemat...
Can transformers generalize efficiently on problems that require dealing with examples with differen...
Transformer based language models exhibit intelligent behaviors such as understanding natural langua...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Algorithmic generalization in machine learning refers to the ability to learn the underlying algorit...
When trained on language data, do transformers learn some arbitrary computation that utilizes the fu...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
Structure prediction (SP) tasks are important in natural language understanding in the sense that th...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Humans can systematically generalize to novel compositions of existing concepts. Recent studies argu...
The deep learning architecture associated with ChatGPT and related generative AI products is known a...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
The transformer is a neural network component that can be used to learn useful representations of se...