Despite pre-trained language models such as BERT have achieved appealing performance in a wide range of Natural Language Processing (NLP) tasks, they are computationally expensive to be deployed in real-time applications. A typical method is to adopt knowledge distillation to compress these large pre-trained models (teacher models) to small student models. However, for a target domain with scarce training data, the teacher can hardly pass useful knowledge to the student, which yields performance degradation for the student models. To tackle this problem, we propose a method to learn to augment data for BERT Knowledge Distillation in target domains with scarce labeled data, by learning a cross-domain manipulation scheme that automatically au...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
The use of superior algorithms and complex architectures in language models have successfully impart...
Deep and large pre-trained language models (e.g., BERT, GPT-3) are state-of-the-art for various natu...
Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant p...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Large-scale pretrained language models have led to significant improvements in Natural Language Proc...
In the natural language processing (NLP) literature, neural networks are becoming increasingly deepe...
Knowledge distillation is typically conducted by training a small model (the student) to mimic a lar...
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage re...
The pre-training models such as BERT have achieved great results in various natural language process...
Building a Neural Language Model from scratch involves a big number of different design decisions. Y...
Building a Neural Language Model from scratch involves a big number of different design decisions. Y...
Pre-trained language representation models, such as BERT, capture a general language representation ...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
The use of superior algorithms and complex architectures in language models have successfully impart...
Deep and large pre-trained language models (e.g., BERT, GPT-3) are state-of-the-art for various natu...
Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant p...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Large-scale pretrained language models have led to significant improvements in Natural Language Proc...
In the natural language processing (NLP) literature, neural networks are becoming increasingly deepe...
Knowledge distillation is typically conducted by training a small model (the student) to mimic a lar...
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage re...
The pre-training models such as BERT have achieved great results in various natural language process...
Building a Neural Language Model from scratch involves a big number of different design decisions. Y...
Building a Neural Language Model from scratch involves a big number of different design decisions. Y...
Pre-trained language representation models, such as BERT, capture a general language representation ...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
Neural machine translation (NMT) systems have greatly improved the quality available from machine tr...
The use of superior algorithms and complex architectures in language models have successfully impart...