Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task.Commen...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete c...
In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for sp...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tas...
Self-supervised speech representation learning has shown promising results in various speech process...
End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the...
Although supervised deep learning has revolutionized speech and audio processing, it has necessitate...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio da...
Self-supervised speech recognition models require considerable labeled training data for learning hi...
Self-supervised speech pre-training empowers the model with the contextual structure inherent in the...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete c...
In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for sp...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tas...
Self-supervised speech representation learning has shown promising results in various speech process...
End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the...
Although supervised deep learning has revolutionized speech and audio processing, it has necessitate...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio da...
Self-supervised speech recognition models require considerable labeled training data for learning hi...
Self-supervised speech pre-training empowers the model with the contextual structure inherent in the...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...