The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learn...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
© 2017 Association for Computing Machinery. Cross-modal retrieval aims to enable flexible retrieval ...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
Cross-modality retrieval encompasses retrieval tasks where the fetched items are of a different type...
In recent years, there have been numerous developments towards solving multimodal tasks, aiming to l...
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-mod...
The cross-modal retrieval task can return different modal nearest neighbors, such as image or text. ...
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multime...
We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-mod...
Cross-modal representation learning learns a shared embedding between two or more modalities to impr...
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantica...
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual e...
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained mode...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
© 2017 Association for Computing Machinery. Cross-modal retrieval aims to enable flexible retrieval ...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
Cross-modality retrieval encompasses retrieval tasks where the fetched items are of a different type...
In recent years, there have been numerous developments towards solving multimodal tasks, aiming to l...
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-mod...
The cross-modal retrieval task can return different modal nearest neighbors, such as image or text. ...
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multime...
We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-mod...
Cross-modal representation learning learns a shared embedding between two or more modalities to impr...
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantica...
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates giv...
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual e...
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained mode...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
© 2017 Association for Computing Machinery. Cross-modal retrieval aims to enable flexible retrieval ...
We propose a novel deep training algorithm for joint representation of audio and visual information ...