Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to as ``audionese''. To this end, we rewrite prompts with instruction-tuned model...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of buildin...
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effe...
Distributional shift is a central challenge in the deployment of machine learning models as they can...
With the similarity between music and speech synthesis from symbolic input and the rapid development...
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing...
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trai...
Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2...
We present work in progress on TimbreCLIP, an audio-text cross modal embedding trained on single ins...
High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collec...
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intel...
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio c...
Speech conveys more information than just text, as the same word can be uttered in various voices to...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
Audio classification plays a crucial role in speech and sound processing tasks with a wide range of ...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of buildin...
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effe...
Distributional shift is a central challenge in the deployment of machine learning models as they can...
With the similarity between music and speech synthesis from symbolic input and the rapid development...
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing...
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trai...
Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2...
We present work in progress on TimbreCLIP, an audio-text cross modal embedding trained on single ins...
High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collec...
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intel...
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio c...
Speech conveys more information than just text, as the same word can be uttered in various voices to...
The quality of end-to-end neural text-to-speech (TTS) systems highly depends on the reliable estimat...
Audio classification plays a crucial role in speech and sound processing tasks with a wide range of ...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
The advent of hyper-scale and general-purpose pre-trained models is shifting the paradigm of buildin...
We present a non-supervised approach to optimize and evaluate the synthesis of non-speech audio effe...