The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have de...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
In recent years, the concept of end-to-end text-to-speech synthesis has begun to attract the attenti...
To accomplish punctuation restoration, most existing methods focus on introducing extra information ...
In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance ...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in p...
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthe...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
The task of text-to-speech (TTS) synthesis usually refers to a single language and to a single speak...
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality natura...
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have show...
This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal o...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
In recent years, the concept of end-to-end text-to-speech synthesis has begun to attract the attenti...
To accomplish punctuation restoration, most existing methods focus on introducing extra information ...
In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance ...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in p...
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthe...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
The task of text-to-speech (TTS) synthesis usually refers to a single language and to a single speak...
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality natura...
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have show...
This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal o...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
In recent years, the concept of end-to-end text-to-speech synthesis has begun to attract the attenti...
To accomplish punctuation restoration, most existing methods focus on introducing extra information ...