We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems. Through extensive experiments, we show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems on a target language using only one target-language speaker during model training. We managed to close the gap between ASR models trained with synthesized versus human speech compared to other works that use many speakers. Finally, we show that it is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.Comment: The paper is under consideration at the...
Self-supervised learning (SSL) methods which learn representations of data without explicit supervis...
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (AS...
Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resu...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
The idea of combining multiple languages’ recordings to train a single automatic speech recognition ...
The recent development of neural network-based automatic speech recognition (ASR) systems has greatl...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning fr...
Rapid deployment of automatic speech recognition (ASR) in new languages, with very limited data, is ...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
In this paper, we study the disentanglement of speaker and language representations in non-autoregre...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
Self-supervised learning (SSL) methods which learn representations of data without explicit supervis...
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (AS...
Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resu...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
The idea of combining multiple languages’ recordings to train a single automatic speech recognition ...
The recent development of neural network-based automatic speech recognition (ASR) systems has greatl...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning fr...
Rapid deployment of automatic speech recognition (ASR) in new languages, with very limited data, is ...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
In this paper, we study the disentanglement of speaker and language representations in non-autoregre...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
Self-supervised learning (SSL) methods which learn representations of data without explicit supervis...
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (AS...
Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resu...