Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS...
In this paper, we work on a sound recognition system that continually incorporates new sound classes...
We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extract...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Pre-trained models are essential as feature extractors in modern machine learning systems in various...
The success of supervised deep learning methods is largely due to their ability to learn relevant fe...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
The goal of universal audio representation learning is to obtain foundational models that can be use...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio da...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Although supervised deep learning has revolutionized speech and audio processing, it has necessitate...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
In this paper, we work on a sound recognition system that continually incorporates new sound classes...
We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extract...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Pre-trained models are essential as feature extractors in modern machine learning systems in various...
The success of supervised deep learning methods is largely due to their ability to learn relevant fe...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
The goal of universal audio representation learning is to obtain foundational models that can be use...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio da...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Although supervised deep learning has revolutionized speech and audio processing, it has necessitate...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
In this paper, we work on a sound recognition system that continually incorporates new sound classes...
We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extract...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...