In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance the naturalness and controllability of synthesized speech. While human prosody annotation contributes a lot to the performance, it is a labor-intensive and time-consuming process, often resulting in inconsistent outcomes. Despite the availability of extensive supervised data, the current benchmark model still faces performance setbacks. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. Specifically, in the first stage, we propose contrastive text-speech pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs. The pretraining procedure hammers at enhancing the prosodic space extracted f...
UnrestrictedProsody refers to rhythm, intonation, and lexical stress in speech, and is expressed via...
Speech utterances are more than the linear concatenation of individual phonemes or words. They are o...
Since the prosody of a spoken utterance carries information about its discourse function, salience, ...
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthe...
<p>Prosody and prosodic modeling in trainable Speech Synthesis systems are often based on large corp...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in p...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
A new method for predicting prosodic parameters, i.e. phone durations and F0 targets, from preproce...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extra...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal...
This paper introduces the Common Prosody Platform (CPP), a computational platform that implements ma...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
UnrestrictedProsody refers to rhythm, intonation, and lexical stress in speech, and is expressed via...
Speech utterances are more than the linear concatenation of individual phonemes or words. They are o...
Since the prosody of a spoken utterance carries information about its discourse function, salience, ...
This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthe...
<p>Prosody and prosodic modeling in trainable Speech Synthesis systems are often based on large corp...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in p...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
A new method for predicting prosodic parameters, i.e. phone durations and F0 targets, from preproce...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extra...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal...
This paper introduces the Common Prosody Platform (CPP), a computational platform that implements ma...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
UnrestrictedProsody refers to rhythm, intonation, and lexical stress in speech, and is expressed via...
Speech utterances are more than the linear concatenation of individual phonemes or words. They are o...
Since the prosody of a spoken utterance carries information about its discourse function, salience, ...