In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions
The ability to accurately align concepts between languages can provide significant benefits in many ...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Multilingual (or cross-lingual) embeddings represent several languages in a unique vector space. Usi...
With the aim of promoting and understanding the multilingual version of image search, we leverage vi...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Bilingual lexicon induction, translating words from the source language to the target language, is a...
In this paper, we present a thorough investigation on methods that align pre-trained contextualized ...
Image captioning has emerged as an interesting research field in recent years due to its broad appli...
The ability to accurately align concepts between languages can provide significant benefits in many ...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Multilingual (or cross-lingual) embeddings represent several languages in a unique vector space. Usi...
With the aim of promoting and understanding the multilingual version of image search, we leverage vi...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Bilingual lexicon induction, translating words from the source language to the target language, is a...
In this paper, we present a thorough investigation on methods that align pre-trained contextualized ...
Image captioning has emerged as an interesting research field in recent years due to its broad appli...
The ability to accurately align concepts between languages can provide significant benefits in many ...
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fr...
Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval ...