While most research into speech synthesis has focused on synthesizing high-quality speech for in-dataset speakers, an equally essential yet unsolved problem is synthesizing speech for unseen speakers who are out-of-dataset with limited reference data, i.e., speaker adaptive speech synthesis. Many studies have proposed zero-shot speaker adaptive text-to-speech and voice conversion approaches aimed at this task. However, most current approaches suffer from the degradation of naturalness and speaker similarity when synthesizing speech for unseen speakers (i.e., speakers not in the training dataset) due to the poor generalizability of the model in out-of-distribution data. To address this problem, we propose GZS-TV, a generalizable zero-shot sp...
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre o...
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However th...
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is tr...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
Personalizing a speech synthesis system is a highly desired application, where the system can genera...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Disentanglement is the task of learning representations that identify and separate factors that expl...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt...
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizi...
Recent advancements in deep learning led to human-level per-formance in single-speaker speech synthe...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one...
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre o...
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However th...
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is tr...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
Personalizing a speech synthesis system is a highly desired application, where the system can genera...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Disentanglement is the task of learning representations that identify and separate factors that expl...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt...
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizi...
Recent advancements in deep learning led to human-level per-formance in single-speaker speech synthe...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one...
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre o...
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However th...
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is tr...