We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any ...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
Serenity and fluency are the most important synthesis qualities expected from text-tospeech. This p...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
Accent plays a significant role in speech communication, influencing understanding capabilities and ...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
In this paper, we study the disentanglement of speaker and language representations in non-autoregre...
In this paper we present a new method to synthesize multiple languages with the voice of any arbitra...
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning fr...
Text-to-speech synthesis (TTS) has progressed to such a stage that given a large, clean, phoneticall...
Text-to-speech synthesis (TTS) turns a written text into an audio speech signal. Many commercial sys...
Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained ...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
Serenity and fluency are the most important synthesis qualities expected from text-tospeech. This p...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
Accent plays a significant role in speech communication, influencing understanding capabilities and ...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
In this paper, we study the disentanglement of speaker and language representations in non-autoregre...
In this paper we present a new method to synthesize multiple languages with the voice of any arbitra...
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning fr...
Text-to-speech synthesis (TTS) has progressed to such a stage that given a large, clean, phoneticall...
Text-to-speech synthesis (TTS) turns a written text into an audio speech signal. Many commercial sys...
Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained ...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
Serenity and fluency are the most important synthesis qualities expected from text-tospeech. This p...