Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make the usage of these models less costly recent work has explored leveraging structured and unstructured pruning, quantization, and distillation as ways to improve inference speed and decrease size. This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Our experimentation shows that models that are pruned during pretraining using general domain masked language model...
The growing energy and performance costs of deep learning have driven the community to reduce the si...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Transformer-based language models have become a key building block for natural language processing. ...
Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processin...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
Neural networks are powerful solutions to help with decision making and solve complex problems in r...
Arguably one of the most notable forms of the principle of parsimony was formulated by the philosoph...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of l...
The growing energy and performance costs of deep learning have driven the community to reduce the si...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Transformer-based language models have become a key building block for natural language processing. ...
Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processin...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
Neural networks are powerful solutions to help with decision making and solve complex problems in r...
Arguably one of the most notable forms of the principle of parsimony was formulated by the philosoph...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
Inducing sparseness while training neural networks has been shown to yield models with a lower memor...
We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of l...
The growing energy and performance costs of deep learning have driven the community to reduce the si...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...