Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the improved inference speed may still be unsatisfactory for real-time or high-volume use cases. In this paper, we aim to further push the limit of inference speed by distilling teacher models into bigger, sparser student models -- bigger in that they scale up to billions of parameters; sparser in that most of the model parameters are n-gram embeddings. Our experiments on six single-sentence text classification tasks show that these student models retain...
Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerfu...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Scaling language models with more data, compute and parameters has driven significant progress in na...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
Large pretrained language models have achieved state-of-the-art results on a variety of downstream t...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Overparameterized neural networks generalize well but are expensive to train. Ideally, one would lik...
Recent trends in language modeling have focused on increasing performance through scaling, and have ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerfu...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Scaling language models with more data, compute and parameters has driven significant progress in na...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
Large pretrained language models have achieved state-of-the-art results on a variety of downstream t...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Overparameterized neural networks generalize well but are expensive to train. Ideally, one would lik...
Recent trends in language modeling have focused on increasing performance through scaling, and have ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Transformer-based masked language models trained on general corpora, such as BERT and RoBERTa, have ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerfu...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...