Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway t...
This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in experi...
Statistical learning is a robust mechanism of the brain that enables the extraction of environmental...
Machine learning (ML), a computational self-learning platform, is expected to be applied in a variet...
Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 ...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
When a word is read more than once, reading time generally decreases in the successive occurrences. ...
Text generation tasks, including translation, summarization, language models, and etc. see rapid gro...
While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on...
A central question in natural language understanding (NLU) research is whether high performance demo...
Written text is one of the fundamental manifestations of human language, and the study of its univer...
Developing approaches to improve motor skill learning is of considerable interest across multiple di...
In a single large-scale study, we demonstrate that verbal sequence learning as studied using the cl...
Written text is one of the fundamental manifestations of human language, and the study of its univer...
Neural scaling laws define a predictable relationship between a model's parameter count and its perf...
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, lea...
This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in experi...
Statistical learning is a robust mechanism of the brain that enables the extraction of environmental...
Machine learning (ML), a computational self-learning platform, is expected to be applied in a variet...
Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 ...
Despite their wide adoption, the underlying training and memorization dynamics of very large languag...
When a word is read more than once, reading time generally decreases in the successive occurrences. ...
Text generation tasks, including translation, summarization, language models, and etc. see rapid gro...
While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on...
A central question in natural language understanding (NLU) research is whether high performance demo...
Written text is one of the fundamental manifestations of human language, and the study of its univer...
Developing approaches to improve motor skill learning is of considerable interest across multiple di...
In a single large-scale study, we demonstrate that verbal sequence learning as studied using the cl...
Written text is one of the fundamental manifestations of human language, and the study of its univer...
Neural scaling laws define a predictable relationship between a model's parameter count and its perf...
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, lea...
This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in experi...
Statistical learning is a robust mechanism of the brain that enables the extraction of environmental...
Machine learning (ML), a computational self-learning platform, is expected to be applied in a variet...