Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exempla...
Speech representations learned from Self-supervised learning (SSL) models can benefit various speech...
Voice dictation is an increasingly important text input modality. Existing systems that allow both d...
In-context learning is a recent paradigm in natural language understanding, where a large pre-traine...
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio ...
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing...
We explore the idea of compressing the prompts used to condition language models, and show that comp...
We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative model...
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by ...
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio c...
Speech conveys more information than just text, as the same word can be uttered in various voices to...
Audiobooks are a powerful source of rich information for speech synthesis. Recent work has been foc...
Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2...
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trai...
International audienceAutomated audio captioning is the multimodal task of describing environmental ...
The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of buildin...
Speech representations learned from Self-supervised learning (SSL) models can benefit various speech...
Voice dictation is an increasingly important text input modality. Existing systems that allow both d...
In-context learning is a recent paradigm in natural language understanding, where a large pre-traine...
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio ...
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing...
We explore the idea of compressing the prompts used to condition language models, and show that comp...
We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative model...
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by ...
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio c...
Speech conveys more information than just text, as the same word can be uttered in various voices to...
Audiobooks are a powerful source of rich information for speech synthesis. Recent work has been foc...
Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2...
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trai...
International audienceAutomated audio captioning is the multimodal task of describing environmental ...
The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of buildin...
Speech representations learned from Self-supervised learning (SSL) models can benefit various speech...
Voice dictation is an increasingly important text input modality. Existing systems that allow both d...
In-context learning is a recent paradigm in natural language understanding, where a large pre-traine...