A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which ad...
This article focuses on developing a system for high-quality synthesized and converted speech by add...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
EC Seventh Framework Programme (FP7/2007-2013)Speech technology can facilitate human-machine interac...
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed enviro...
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic v...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre f...
In this paper, we explore an improved framework to train a monoaural neural enhancement model for ro...
Present systems advances in speech processing systems aim at providing sturdy and reliable interface...
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with...
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effe...
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vec...
Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially ...
This article focuses on developing a system for high-quality synthesized and converted speech by add...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
EC Seventh Framework Programme (FP7/2007-2013)Speech technology can facilitate human-machine interac...
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed enviro...
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic v...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre f...
In this paper, we explore an improved framework to train a monoaural neural enhancement model for ro...
Present systems advances in speech processing systems aim at providing sturdy and reliable interface...
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with...
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effe...
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vec...
Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially ...
This article focuses on developing a system for high-quality synthesized and converted speech by add...
We work to create a multilingual speech synthesis system which can generate speech with the proper a...
EC Seventh Framework Programme (FP7/2007-2013)Speech technology can facilitate human-machine interac...