Language and vision provide complementary information. Integrating both modalities in a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped (or \emph{imagined}) vectors not only help to fuse multimodal information, but also outperform...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Texts and images provide alternative, yet orthogonal views of the same underlying cognitive concept....
Human brains integrate linguistic and perceptual information simultaneously to understand natural la...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Integrating visual and linguistic information into a single multimodal representation is an unsolved...
abstract: Multimodal Representation Learning is a multi-disciplinary research field which aims to in...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Recent work has revealed the potential of using visual representations for bilingual lexicon learnin...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
Most human language understanding is grounded in perception. There is thus growing interest in combi...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Texts and images provide alternative, yet orthogonal views of the same underlying cognitive concept....
Human brains integrate linguistic and perceptual information simultaneously to understand natural la...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Integrating visual and linguistic information into a single multimodal representation is an unsolved...
abstract: Multimodal Representation Learning is a multi-disciplinary research field which aims to in...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Recent work has revealed the potential of using visual representations for bilingual lexicon learnin...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
Most human language understanding is grounded in perception. There is thus growing interest in combi...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Texts and images provide alternative, yet orthogonal views of the same underlying cognitive concept....
Human brains integrate linguistic and perceptual information simultaneously to understand natural la...