Transformer based language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. Towards that end, we propose framing Transformer dynamics as movement through embedding space. Examining Transformers through this perspective reveals key insights, establishing a Theory of Transformers: 1) Intelligent behaviours map to paths in Embedding Space which, the Transformer random-walks through during inferencing. 2) LM training learns a probability distribution over all possible paths. `Intelligence' is learnt by assigning higher probabilities to paths...
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if t...
In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of e...
We seek to understand how the representations of individual tokens and the structure of the learned ...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Transformer networks have seen great success in natural language processing and machine vision, wher...
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. W...
Language Generation Models produce words based on the previous context. Although existing methods of...
The transformer is a neural network component that can be used to learn useful representations of se...
When trained on language data, do transformers learn some arbitrary computation that utilizes the fu...
The deep learning architecture associated with ChatGPT and related generative AI products is known a...
The transformer architecture and variants presented remarkable success across many machine learning ...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
We show how to "compile" human-readable programs into standard decoder-only transformer models. Our ...
Pretrained transformer-based language models achieve state-of-the-art performance in many NLP tasks,...
We analyze the Knowledge Neurons framework for the attribution of factual and relational knowledge t...
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if t...
In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of e...
We seek to understand how the representations of individual tokens and the structure of the learned ...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Transformer networks have seen great success in natural language processing and machine vision, wher...
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. W...
Language Generation Models produce words based on the previous context. Although existing methods of...
The transformer is a neural network component that can be used to learn useful representations of se...
When trained on language data, do transformers learn some arbitrary computation that utilizes the fu...
The deep learning architecture associated with ChatGPT and related generative AI products is known a...
The transformer architecture and variants presented remarkable success across many machine learning ...
This document aims to be a self-contained, mathematically precise overview of transformer architectu...
We show how to "compile" human-readable programs into standard decoder-only transformer models. Our ...
Pretrained transformer-based language models achieve state-of-the-art performance in many NLP tasks,...
We analyze the Knowledge Neurons framework for the attribution of factual and relational knowledge t...
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if t...
In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of e...
We seek to understand how the representations of individual tokens and the structure of the learned ...