For a language model (LM) to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We propose analyzing compression in (pre-trained) LMs from two points of view: geometric and information-theoretic. We demonstrate that the two views are highly correlated, such that the intrinsic geometric dimension of linguistic data predicts their coding length under the LM. We then show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset, confirming that being able to compress linguistic information is an important part of successful LM performance. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension...
196 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2000.We then turn to construction ...
This paper reviews evidence for the idea that much of human learning, perception, and cognition may ...
Large information sizes in samples and features can be encoded to speed up the learning of statistic...
Multilingual models are often particularly dependent on scaling to generalize to a growing number of...
This paper describes two techniques for reducing the size of statistical back-off-gram language mode...
Compression is a fundamental goal of both human language and digital communication, yet natural lang...
The best general-purpose compression schemes make their gains by estimating a probability distributi...
We assess how multilingual language models maintain a shared multilingual representation space while...
This paper describes a novel approach of compressing large trigram language models, which uses scala...
There is an ongoing debate in the NLP community whether modern language models contain linguistic k...
The majority of machine learning research has been focused on building models and inference techniqu...
Abstract—A novel approach is developed for nonlinear compression and reconstruction of high- or even...
We derive a principled information-theoretic account of cross-language semantic variation. Specifica...
The majority of machine learning research has been fo-cused on building models and inference techniq...
The capabilities of large language models (LLMs) have sparked debate over whether such systems just ...
196 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2000.We then turn to construction ...
This paper reviews evidence for the idea that much of human learning, perception, and cognition may ...
Large information sizes in samples and features can be encoded to speed up the learning of statistic...
Multilingual models are often particularly dependent on scaling to generalize to a growing number of...
This paper describes two techniques for reducing the size of statistical back-off-gram language mode...
Compression is a fundamental goal of both human language and digital communication, yet natural lang...
The best general-purpose compression schemes make their gains by estimating a probability distributi...
We assess how multilingual language models maintain a shared multilingual representation space while...
This paper describes a novel approach of compressing large trigram language models, which uses scala...
There is an ongoing debate in the NLP community whether modern language models contain linguistic k...
The majority of machine learning research has been focused on building models and inference techniqu...
Abstract—A novel approach is developed for nonlinear compression and reconstruction of high- or even...
We derive a principled information-theoretic account of cross-language semantic variation. Specifica...
The majority of machine learning research has been fo-cused on building models and inference techniq...
The capabilities of large language models (LLMs) have sparked debate over whether such systems just ...
196 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2000.We then turn to construction ...
This paper reviews evidence for the idea that much of human learning, perception, and cognition may ...
Large information sizes in samples and features can be encoded to speed up the learning of statistic...