We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Through listening tests and speech-to-text back transcription, we show that EdiTTS outperforms existing baselines and produces robust...
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to t...
Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of t...
We present EdiT5 - a novel semi-autoregressive text-editing approach designed to combine the strengt...
Text-editing models have recently become a prominent alternative to seq2seq models for monolingual t...
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by ...
Removing background noise from speech audio has been the subject of considerable research and effort...
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed enviro...
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Distributional shift is a central challenge in the deployment of machine learning models as they can...
Voice dictation is an increasingly important text input modality. Existing systems that allow both d...
We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The p...
We present RemixIT, a simple yet effective self-supervised method for training speech enhancement wi...
Although diffusion models in text-to-speech have become a popular choice due to their strong generat...
Writing is, by nature, a strategic, adaptive, and more importantly, an iterative process. A crucial ...
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to t...
Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of t...
We present EdiT5 - a novel semi-autoregressive text-editing approach designed to combine the strengt...
Text-editing models have recently become a prominent alternative to seq2seq models for monolingual t...
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by ...
Removing background noise from speech audio has been the subject of considerable research and effort...
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed enviro...
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
Distributional shift is a central challenge in the deployment of machine learning models as they can...
Voice dictation is an increasingly important text input modality. Existing systems that allow both d...
We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The p...
We present RemixIT, a simple yet effective self-supervised method for training speech enhancement wi...
Although diffusion models in text-to-speech have become a popular choice due to their strong generat...
Writing is, by nature, a strategic, adaptive, and more importantly, an iterative process. A crucial ...
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to t...
Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of t...
We present EdiT5 - a novel semi-autoregressive text-editing approach designed to combine the strengt...