Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by ...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
This electronic version was submitted by the student author. The certified thesis is available in th...
This work explores how to use self-supervised learning on videos to learn a class-specific image emb...
Self supervised representation learning has recently attracted a lot of research interest for both t...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...
Automatic speech recognition has seen recent advancements powered by machine learning, but it is sti...
© 2019 IEEE. Segmenting objects in images and separating sound sources in audio are challenging task...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
—Training a robust tracker of objects (such as vehicles and people) using audio and visual informati...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
In this paper, we perform audio-visual sound source separation, i.e. to separate component audios fr...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
This electronic version was submitted by the student author. The certified thesis is available in th...
This work explores how to use self-supervised learning on videos to learn a class-specific image emb...
Self supervised representation learning has recently attracted a lot of research interest for both t...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...
Automatic speech recognition has seen recent advancements powered by machine learning, but it is sti...
© 2019 IEEE. Segmenting objects in images and separating sound sources in audio are challenging task...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
—Training a robust tracker of objects (such as vehicles and people) using audio and visual informati...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
In this paper, we perform audio-visual sound source separation, i.e. to separate component audios fr...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
This electronic version was submitted by the student author. The certified thesis is available in th...
This work explores how to use self-supervised learning on videos to learn a class-specific image emb...