Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, \textit{e.g.}, CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed \textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image Trans...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Cross-modal hashing is usually regarded as an effective technique for large-scale textual-visual cro...
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between who...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an ...
Current state-of-the-art approaches to cross- modal retrieval process text and visual input jointly,...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...
International audienceThe task of retrieving video content relevant to natural language queries play...
The relations expressed in user queries are vital for cross-modal information retrieval. Relation-fo...
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolut...
Current cross modal retrieval systems are evaluated using R@K measure which does not leverage semant...
Lately, cross-modal retrieval has attained plenty of attention due to enormous multi-modal data gene...
Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. I...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Cross-modal hashing is usually regarded as an effective technique for large-scale textual-visual cro...
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between who...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an ...
Current state-of-the-art approaches to cross- modal retrieval process text and visual input jointly,...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...
International audienceThe task of retrieving video content relevant to natural language queries play...
The relations expressed in user queries are vital for cross-modal information retrieval. Relation-fo...
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolut...
Current cross modal retrieval systems are evaluated using R@K measure which does not leverage semant...
Lately, cross-modal retrieval has attained plenty of attention due to enormous multi-modal data gene...
Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. I...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Cross-modal hashing is usually regarded as an effective technique for large-scale textual-visual cro...
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between who...