In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.Peer ReviewedPostprint (author's final draft
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
With the ever-increasing consumption of audio-visual media on the internet, video understanding has ...
With the ever-increasing consumption of audio-visual media on the internet, video understanding has ...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
Cross-modal retrieval learns the relationship between the two types of data in a common space so tha...
We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classific...
In recent years, there have been numerous developments towards solving multimodal tasks, aiming to l...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
International audienceWith the recent resurgence of neural networks and the proliferation of massive...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
With the ever-increasing consumption of audio-visual media on the internet, video understanding has ...
With the ever-increasing consumption of audio-visual media on the internet, video understanding has ...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
Cross-modal retrieval learns the relationship between the two types of data in a common space so tha...
We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classific...
In recent years, there have been numerous developments towards solving multimodal tasks, aiming to l...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
International audienceWith the recent resurgence of neural networks and the proliferation of massive...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
With the ever-increasing consumption of audio-visual media on the internet, video understanding has ...
With the ever-increasing consumption of audio-visual media on the internet, video understanding has ...