Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360° video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the ...
There is a natural correlation between the visual and auditive elements of a video. In this work, we...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
The domain of spatial audio comprises methods for capturing, processing, and reproducing audio conte...
Learning from audio-visual data offers many possibilities to express correspondence between the audi...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
We propose a novel method for mapping sound spectrograms onto images and thus enabling alignment bet...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
Self supervised representation learning has recently attracted a lot of research interest for both t...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
Humans can robustly recognize and localize objects by integrating visual and auditory cues. While ma...
In this paper, we perform audio-visual sound source separation, i.e. to separate component audios fr...
Previous works on scene classification are mainly based on audio or visual signals, while humans per...
There is a natural correlation between the visual and auditive elements of a video. In this work, we...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
The domain of spatial audio comprises methods for capturing, processing, and reproducing audio conte...
Learning from audio-visual data offers many possibilities to express correspondence between the audi...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
We propose a novel method for mapping sound spectrograms onto images and thus enabling alignment bet...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
Self supervised representation learning has recently attracted a lot of research interest for both t...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
Humans can robustly recognize and localize objects by integrating visual and auditory cues. While ma...
In this paper, we perform audio-visual sound source separation, i.e. to separate component audios fr...
Previous works on scene classification are mainly based on audio or visual signals, while humans per...
There is a natural correlation between the visual and auditive elements of a video. In this work, we...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
The domain of spatial audio comprises methods for capturing, processing, and reproducing audio conte...