In this paper, we study the disentanglement of speaker and language representations in non-autoregressive cross-lingual TTS models from various aspects. We propose a phoneme length regulator that solves the length mismatch problem between IPA input sequence and monolingual alignment results. Using the phoneme length regulator, we present a FastPitch-based cross-lingual model with IPA symbols as input representations. Our experiments show that language-independent input representations (e.g. IPA symbols), an increasing number of training speakers, and explicit modeling of speech variance information all encourage non-autoregressive cross-lingual TTS model to disentangle speaker and language representations. The subjective evaluation shows th...
Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive...
Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) ...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
This paper contains a post-challenge performance analysis on cross-lingual speaker verification of t...
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-li...
Recently, sequence-to-sequence (seq-to-seq) models have been successfully applied in text-to-speech ...
The recent development of neural network-based automatic speech recognition (ASR) systems has greatl...
The idea of combining multiple languages’ recordings to train a single automatic speech recognition ...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Pre-trained multilingual language models show significant performance gains for zero-shot cross-ling...
Speech and text are two major forms of human language. The research community has been focusing on m...
Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive...
Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) ...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
This paper contains a post-challenge performance analysis on cross-lingual speaker verification of t...
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-li...
Recently, sequence-to-sequence (seq-to-seq) models have been successfully applied in text-to-speech ...
The recent development of neural network-based automatic speech recognition (ASR) systems has greatl...
The idea of combining multiple languages’ recordings to train a single automatic speech recognition ...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Pre-trained multilingual language models show significant performance gains for zero-shot cross-ling...
Speech and text are two major forms of human language. The research community has been focusing on m...
Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive...
Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) ...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...