In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.Peer Reviewe
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
Multimodal information processing has received considerable attention in recent years. The focus of ...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
Cross-modal retrieval learns the relationship between the two types of data in a common space so tha...
We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classific...
Abstract — Human has an amazing cross-modal learning capability. In order to endow the computers wit...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
Within the last years the amount of digital media has been spread due to efficient media encoding al...
Conference of 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L...
Conference of 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L...
Conference of 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
Multimodal information processing has received considerable attention in recent years. The focus of ...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
Cross-modal retrieval learns the relationship between the two types of data in a common space so tha...
We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classific...
Abstract — Human has an amazing cross-modal learning capability. In order to endow the computers wit...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
Within the last years the amount of digital media has been spread due to efficient media encoding al...
Conference of 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L...
Conference of 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L...
Conference of 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
Multimodal information processing has received considerable attention in recent years. The focus of ...