Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap be...
The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech ...
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-sta...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
Large language models have become a vital component in modern NLP, achieving state of the art perfor...
Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different doma...
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams....
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio da...
The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech ...
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-sta...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
Large language models have become a vital component in modern NLP, achieving state of the art perfor...
Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different doma...
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams....
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio da...
The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech ...
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-sta...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...