Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly with simple aggregation functions, leading to early saturated solutions. In this work, we propose Online Knowledge Distillation with Diverse peers (OKDDip), which performs two-level distillation during training with multiple auxiliary...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Knowledge distillation extracts general knowledge from a pretrained teacher network and provides gui...
Knowledge distillation (KD) is a method in which a teacher network guides the learning of a student ...
Distillation is an effective knowledge-transfer technique that uses predicted distributions of a pow...
Traditional knowledge distillation uses a two-stage training strategy to transfer knowledge from a h...
Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-...
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage re...
Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large tea...
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to...
Knowledge distillation (KD) has shown very promising capabilities in transferring learning represent...
Deep neural networks have achieved a great success in a variety of applications, such as self-drivin...
With the development of deep learning, advanced dialogue generation methods usually require a greate...
Knowledge distillation is considered as a training and compression strategy in which two neural netw...
Despite the fact that deep neural networks are powerful models and achieve appealing results on many...
Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher mod...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Knowledge distillation extracts general knowledge from a pretrained teacher network and provides gui...
Knowledge distillation (KD) is a method in which a teacher network guides the learning of a student ...
Distillation is an effective knowledge-transfer technique that uses predicted distributions of a pow...
Traditional knowledge distillation uses a two-stage training strategy to transfer knowledge from a h...
Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-...
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage re...
Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large tea...
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to...
Knowledge distillation (KD) has shown very promising capabilities in transferring learning represent...
Deep neural networks have achieved a great success in a variety of applications, such as self-drivin...
With the development of deep learning, advanced dialogue generation methods usually require a greate...
Knowledge distillation is considered as a training and compression strategy in which two neural netw...
Despite the fact that deep neural networks are powerful models and achieve appealing results on many...
Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher mod...
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style langu...
Knowledge distillation extracts general knowledge from a pretrained teacher network and provides gui...
Knowledge distillation (KD) is a method in which a teacher network guides the learning of a student ...