Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network le...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notorio...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
Automatic speech recognition has seen recent advancements powered by machine learning, but it is sti...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...
Although supervised deep learning has revolutionized speech and audio processing, it has necessitate...
25 pages, 14 figures, https://samsad35.github.io/site-mdvae/In this paper, we present a multimodal \...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notorio...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notorio...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
Automatic speech recognition has seen recent advancements powered by machine learning, but it is sti...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...
Although supervised deep learning has revolutionized speech and audio processing, it has necessitate...
25 pages, 14 figures, https://samsad35.github.io/site-mdvae/In this paper, we present a multimodal \...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notorio...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notorio...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...