Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications, especially image-text retrieval in the fields of computer vision and natural language processing. Recently, visual and semantic embedding (VSE) learning has shown promising improvements on image-text retrieval tasks. Most existing VSE models employ two unrelated encoders to extract features, then use complex methods to contextualize and aggregate those features into holistic embeddings. Despite recent advances, existing approaches still suffer from two limitations: 1) without considering intermediate interaction and adequate alignment between different modalities, these models cannot guarantee the discriminative ability of representa...
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot r...
Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptio...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. I...
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching im...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot recognitio...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...
Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptio...
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot r...
Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptio...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. I...
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching im...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot recognitio...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...
Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptio...
Abstract In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot r...
Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their descriptio...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...