People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-V...
International audienceAudiovisual (AV) representation learning is an important task from the perspec...
Visual and audio modalities are two symbiotic modalities underlying videos, which contain both commo...
Controllability, despite being a much-desired property of a generative model, remains an ill-defined...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
25 pages, 14 figures, https://samsad35.github.io/site-mdvae/In this paper, we present a multimodal \...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
International audienceReal-world phenomena involve complex interactions between multiple signal moda...
In this paper, we tackle the problem of domain-adaptive representation learning for music processing...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
In this paper, we propose a novel structure for a multi-modal data association referred to as Associ...
International audienceAudiovisual (AV) representation learning is an important task from the perspec...
Visual and audio modalities are two symbiotic modalities underlying videos, which contain both commo...
Controllability, despite being a much-desired property of a generative model, remains an ill-defined...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
25 pages, 14 figures, https://samsad35.github.io/site-mdvae/In this paper, we present a multimodal \...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
International audienceReal-world phenomena involve complex interactions between multiple signal moda...
In this paper, we tackle the problem of domain-adaptive representation learning for music processing...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
In this paper, we propose a novel structure for a multi-modal data association referred to as Associ...
International audienceAudiovisual (AV) representation learning is an important task from the perspec...
Visual and audio modalities are two symbiotic modalities underlying videos, which contain both commo...
Controllability, despite being a much-desired property of a generative model, remains an ill-defined...