Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require...
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to...
Text-to-speech synthesis (TTS) has progressed to such a stage that given a large, clean, phoneticall...
Distributional shift is a central challenge in the deployment of machine learning models as they can...
Speech conveys more information than just text, as the same word can be uttered in various voices to...
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even ar...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "arti...
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to t...
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio ...
Getting a text to speech synthesis (TTS) system to speak lively animated stories like a human is ver...
By definition, spontaneous speech is unscripted and created on the fly by the speaker. It is dramati...
Generating expressive, naturally sounding, speech from text using a speech synthesis (TTS) system is...
Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, buildi...
One of the biggest challenges in speech synthesis is the production of contextually-appropriate natu...
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to...
Text-to-speech synthesis (TTS) has progressed to such a stage that given a large, clean, phoneticall...
Distributional shift is a central challenge in the deployment of machine learning models as they can...
Speech conveys more information than just text, as the same word can be uttered in various voices to...
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even ar...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "arti...
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to t...
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio ...
Getting a text to speech synthesis (TTS) system to speak lively animated stories like a human is ver...
By definition, spontaneous speech is unscripted and created on the fly by the speaker. It is dramati...
Generating expressive, naturally sounding, speech from text using a speech synthesis (TTS) system is...
Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, buildi...
One of the biggest challenges in speech synthesis is the production of contextually-appropriate natu...
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to...
Text-to-speech synthesis (TTS) has progressed to such a stage that given a large, clean, phoneticall...
Distributional shift is a central challenge in the deployment of machine learning models as they can...