Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style promp...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dat...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Personalizing a speech synthesis system is a highly desired application, where the system can genera...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
This is preprocessed data and pretrained models from two of our papers: "Zero-Shot Multi-Speaker Te...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input ...
The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-hum...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dat...
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully impleme...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Personalizing a speech synthesis system is a highly desired application, where the system can genera...
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
This is preprocessed data and pretrained models from two of our papers: "Zero-Shot Multi-Speaker Te...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input ...
The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-hum...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...