Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large teacher model to the smaller students, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, teacher-student joint training methods, e.g., online distillation, have been proposed, but it always requires expensive computational cost. In this paper, we present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer by updating relatively few partial parameters. Technically, we first m...
Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach...
Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teach...
How to train an ideal teacher for knowledge distillation is still an open problem. It has been widel...
Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher network to strength...
Distillation is an effective knowledge-transfer technique that uses predicted distributions of a pow...
Most existing distillation methods ignore the flexible role of the temperature in the loss function ...
Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher mod...
Knowledge distillation is typically conducted by training a small model (the student) to mimic a lar...
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage re...
Knowledge distillation (KD) has shown very promising capabilities in transferring learning represent...
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to...
Data-free knowledge distillation (DFKD) is a widely-used strategy for Knowledge Distillation (KD) wh...
Knowledge distillation aims to transfer useful information from a teacher network to a student netwo...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Knowledge distillation extracts general knowledge from a pretrained teacher network and provides gui...
Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach...
Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teach...
How to train an ideal teacher for knowledge distillation is still an open problem. It has been widel...
Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher network to strength...
Distillation is an effective knowledge-transfer technique that uses predicted distributions of a pow...
Most existing distillation methods ignore the flexible role of the temperature in the loss function ...
Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher mod...
Knowledge distillation is typically conducted by training a small model (the student) to mimic a lar...
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage re...
Knowledge distillation (KD) has shown very promising capabilities in transferring learning represent...
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to...
Data-free knowledge distillation (DFKD) is a widely-used strategy for Knowledge Distillation (KD) wh...
Knowledge distillation aims to transfer useful information from a teacher network to a student netwo...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Knowledge distillation extracts general knowledge from a pretrained teacher network and provides gui...
Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach...
Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teach...
How to train an ideal teacher for knowledge distillation is still an open problem. It has been widel...