Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Ye, Qinyuan
Khabsa, Madian
Lewis, Mike
Wang, Sinong
Ren, Xiang
Jaech, Aaron

Publication date

July 2022

Language

English

Abstract

Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the improved inference speed may still be unsatisfactory for real-time or high-volume use cases. In this paper, we aim to further push the limit of inference speed by distilling teacher models into bigger, sparser student models -- bigger in that they scale up to billions of parameters; sparser in that most of the model parameters are n-gram embeddings. Our experiments on six single-sentence text classification tasks show that these student models retain...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Abstract

Extracted data

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Abstract

Extracted data

Related items

Related items