© 2019 IEEE. Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [1]. Here, we introduce a learning approach to disentangle concepts in the neural networks, and assign semantic categories to network feature channels to enable independent image segmentation and sound source separation after audio-visual training on videos. Our evaluations show that the disentangled model outperforms...
We propose a novel method to automatically detect and extract the video modality of the sound source...
Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bea...
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we explore neural network models that learn to associate segments of spoken audio cap...
This paper proposes a new framework for semantic segmentation of objects in videos. We address the l...
Source separation involving mono-channel audio is a challenging problem, in particular for speech se...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its syn...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
We propose to self-supervise a convolutional neural network operating on images using temporal infor...
We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to ...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
We propose a novel method to automatically detect and extract the video modality of the sound source...
Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bea...
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we explore neural network models that learn to associate segments of spoken audio cap...
This paper proposes a new framework for semantic segmentation of objects in videos. We address the l...
Source separation involving mono-channel audio is a challenging problem, in particular for speech se...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Audio-visual separation aims to isolate pure audio sources from mixture with the guidance of its syn...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
We propose to self-supervise a convolutional neural network operating on images using temporal infor...
We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to ...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
We propose a novel method to automatically detect and extract the video modality of the sound source...
Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bea...
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data...