Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can be used to accelerate model development. In this work, we perform such an empirical investigation across a wide range of language understanding tasks, starting from models with as few as 10K parameters, and evaluate downstream performance across 9 language understanding tasks. We find that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models. Moreover, for tasks where scaling laws exist, they can be used to predic...
Scaling laws are ubiquitous in nature, and they pervade neural, behavioral and linguistic activities...
We study trends in model size of notable machine learning systems over time using a curated dataset....
We study the role of an essential hyperparameter that governs the training of Transformers for neura...
Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharm...
It took until the last decade to finally see a machine match human performance on essentially any ta...
Running faster will only get you so far — it is generally advisable to first understand where the ro...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
We study the compute-optimal trade-off between model and training data set sizes for large neural ne...
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a ...
Scaling laws are ubiquitous in nature, and they pervade neural, behavioral and linguistic activities...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...
As language models scale up, it becomes increasingly expensive to verify research ideas because conc...
Scaling laws are ubiquitous in nature, and they pervade neural, behavioral and linguistic activities...
We study trends in model size of notable machine learning systems over time using a curated dataset....
We study the role of an essential hyperparameter that governs the training of Transformers for neura...
Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharm...
It took until the last decade to finally see a machine match human performance on essentially any ta...
Running faster will only get you so far — it is generally advisable to first understand where the ro...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
We study the compute-optimal trade-off between model and training data set sizes for large neural ne...
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of...
Thesis (Ph.D.)--University of Washington, 2023Language models (LMs) are at the core of almost all st...
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a ...
Scaling laws are ubiquitous in nature, and they pervade neural, behavioral and linguistic activities...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
This article describes our experiments in neural machine translation using the recent Tensor2Tensor ...
As language models scale up, it becomes increasingly expensive to verify research ideas because conc...
Scaling laws are ubiquitous in nature, and they pervade neural, behavioral and linguistic activities...
We study trends in model size of notable machine learning systems over time using a curated dataset....
We study the role of an essential hyperparameter that governs the training of Transformers for neura...