The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and...
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of wav...
Recent advances in neural network -based text-to-speech have reached human level naturalness in synt...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vec...
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower...
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizi...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has s...
With the similarity between music and speech synthesis from symbolic input and the rapid development...
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have show...
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, the...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging ...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-lev...
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of wav...
Recent advances in neural network -based text-to-speech have reached human level naturalness in synt...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vec...
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower...
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizi...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has s...
With the similarity between music and speech synthesis from symbolic input and the rapid development...
In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have show...
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, the...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging ...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-lev...
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of wav...
Recent advances in neural network -based text-to-speech have reached human level naturalness in synt...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....