Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language. However, existing methods still heavily rely on hand-crafted pre-processing. Similar to the trend of making supervised speech recognition end-to-end, we introduce wav2vec-U 2.0 which does away with all audio-side pre-processing and improves accuracy through better architecture. In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input. Experiments show that wav2vec-U 2.0 improves unsupervised recognition results across different languages while being conceptually simpler.Comment: Preprin
Self-supervised speech models have grown fast during the past few years and have proven feasible for...
In recent years, speech-based self-supervised learning (SSL) has made significant progress in variou...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
Self-supervised pre-training could effectively improve the performance of low-resource automatic spe...
Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utt...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations i...
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is p...
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to id...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizi...
Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. T...
This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recogni...
Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
Self-supervised speech models have grown fast during the past few years and have proven feasible for...
In recent years, speech-based self-supervised learning (SSL) has made significant progress in variou...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
Self-supervised pre-training could effectively improve the performance of low-resource automatic spe...
Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utt...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations i...
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is p...
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to id...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizi...
Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. T...
This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recogni...
Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
Self-supervised speech models have grown fast during the past few years and have proven feasible for...
In recent years, speech-based self-supervised learning (SSL) has made significant progress in variou...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...