Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scale and bias parameters of the affine coupling layers according to the given speaker embedding vector. In this letter, we improve on the previous speaker conditioning method by introducing a speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker speech synthesis in a zero-shot manner leveraging a normalization-based condition...
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of wav...
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is tr...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dat...
Personalizing a speech synthesis system is a highly desired application, where the system can genera...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt...
Recent advancements in deep learning led to human-level per-formance in single-speaker speech synthe...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However th...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
This is preprocessed data and pretrained models from two of our papers: "Zero-Shot Multi-Speaker Te...
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of wav...
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is tr...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dat...
Personalizing a speech synthesis system is a highly desired application, where the system can genera...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt...
Recent advancements in deep learning led to human-level per-formance in single-speaker speech synthe...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However th...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
This is preprocessed data and pretrained models from two of our papers: "Zero-Shot Multi-Speaker Te...
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of wav...
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is tr...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...