Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound serves as a bridge to connect multiple instances of a visual scene. It can group scenes that 'go together' and set apart the ones that do not. Co-occurring sensory signals can thus be used as a target to learn powerful representations for visual inputs without relying on costly human annotations. In this thesis, I introduce effective self-supervised learning methods that curb the need for human supervision. I discuss several tasks that benefit from audio-visual learning, including representation learning for action and audio recognition, visually-driven sound source localization, and spatial sound generation. I introduce an effective contrasti...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
The sound of crashing waves, the roar of fast-moving cars – sound conveys important information abou...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
Understanding scenes and events is inherently a multi-modal experience. We perceive the world by bo...
Learning from audio-visual data offers many possibilities to express correspondence between the audi...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Comp...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Self supervised representation learning has recently attracted a lot of research interest for both t...
We propose a self-supervised learning approach for videos that learns representations of both the RG...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
The sound of crashing waves, the roar of fast-moving cars – sound conveys important information abou...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
Understanding scenes and events is inherently a multi-modal experience. We perceive the world by bo...
Learning from audio-visual data offers many possibilities to express correspondence between the audi...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Comp...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Self supervised representation learning has recently attracted a lot of research interest for both t...
We propose a self-supervised learning approach for videos that learns representations of both the RG...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data...
We consider the question: what can be learnt by looking at and listening to a large number of unlabe...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...