Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language ...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Recently, sparse training methods have started to be established as a de facto approach for training...
In this paper, we introduce a new perspective on training deep neural networks capable of state-of-t...
Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, espe...
Overparameterized neural networks generalize well but are expensive to train. Ideally, one would lik...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Deep neural networks (DNN) are the state-of-the-art machine learning models outperforming traditiona...
Machine learning models based on the aggregated outputs of submodels, either at the activation or pr...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Deep learning has been empirically successful in recent years thanks to the extremely over-parameter...
In energy-efficient schemes, finding the optimal size of deep learning models is very important and ...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Recently, sparse training methods have started to be established as a de facto approach for training...
In this paper, we introduce a new perspective on training deep neural networks capable of state-of-t...
Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, espe...
Overparameterized neural networks generalize well but are expensive to train. Ideally, one would lik...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Deep neural networks (DNN) are the state-of-the-art machine learning models outperforming traditiona...
Machine learning models based on the aggregated outputs of submodels, either at the activation or pr...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Deep learning has been empirically successful in recent years thanks to the extremely over-parameter...
In energy-efficient schemes, finding the optimal size of deep learning models is very important and ...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...