Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or...
We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcrip...
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...
We aim to characterize how different speakers contribute to the perceived output quality of multi-sp...
In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset ...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have show...
This letter proposes a perceptual metric for speech quality evaluation, which is suitable, as a loss...
The acoustic environment can degrade speech quality during communication (e.g., video call, remote p...
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(...
With the similarity between music and speech synthesis from symbolic input and the rapid development...
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed enviro...
We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS....
We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcrip...
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...
We aim to characterize how different speakers contribute to the perceived output quality of multi-sp...
In this work, we present the SOMOS dataset, the first large-scale mean opinion scores (MOS) dataset ...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have show...
This letter proposes a perceptual metric for speech quality evaluation, which is suitable, as a loss...
The acoustic environment can degrade speech quality during communication (e.g., video call, remote p...
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(...
With the similarity between music and speech synthesis from symbolic input and the rapid development...
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed enviro...
We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS....
We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcrip...
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily...
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enh...