While generative adversarial networks (GANs) based neural text-to-speech (TTS) systems have shown significant improvement in neural speech synthesis, there is no TTS system to learn to synthesize speech from text sequences with only adversarial feedback. Because adversarial feedback alone is not sufficient to train the generator, current models still require the reconstruction loss compared with the ground-truth and the generated mel-spectrogram directly. In this paper, we present Multi-SpectroGAN (MSG), which can train the multi-speaker model with only the adversarial feedback by conditioning a self-supervised hidden representation of the generator to a conditional discriminator. This leads to better guidance for generator training. Moreov...
This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficien...
Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal voc...
This paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. St...
Recent approaches in text-to-speech (TTS) synthesis employ neural network strategies to vocode perce...
The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to nov...
Recent advances in neural network -based text-to-speech have reached human level naturalness in synt...
While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesi...
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has s...
The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and...
Generating speech in different styles from any given style is a challenging research problem in spee...
The goal of voice conversion (VC) is to convert speech from a source speaker to that of a target, wi...
The paper presents a novel architecture and method for training neural networks to produce synthesiz...
An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio...
Single-image generative adversarial networks learn from the internal distribution of a single traini...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....
This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficien...
Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal voc...
This paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. St...
Recent approaches in text-to-speech (TTS) synthesis employ neural network strategies to vocode perce...
The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to nov...
Recent advances in neural network -based text-to-speech have reached human level naturalness in synt...
While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesi...
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has s...
The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and...
Generating speech in different styles from any given style is a challenging research problem in spee...
The goal of voice conversion (VC) is to convert speech from a source speaker to that of a target, wi...
The paper presents a novel architecture and method for training neural networks to produce synthesiz...
An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio...
Single-image generative adversarial networks learn from the internal distribution of a single traini...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....
This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficien...
Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal voc...
This paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. St...