Language and vision provide complementary information. Integrating both modalities in a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped (or imagined) vectors not only help to fuse multimodal information, but also outperform strong...
This electronic version was submitted by the student author. The certified thesis is available in th...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Integrating visual and linguistic information into a single multimodal representation is an unsolved...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
Grounding natural language onto real-world perception is a fundamental challenge to empower various ...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
Grounding natural language onto real-world perception is a fundamental challenge to empower various ...
We propose Imaginet, a model of learning visually grounded representations of language from coupled ...
Large language models are known to suffer from the hallucination problem in that they are prone to o...
Large language models are known to suffer from the hallucination problem in that they are prone to o...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
This electronic version was submitted by the student author. The certified thesis is available in th...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Integrating visual and linguistic information into a single multimodal representation is an unsolved...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
Grounding natural language onto real-world perception is a fundamental challenge to empower various ...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
Grounding natural language onto real-world perception is a fundamental challenge to empower various ...
We propose Imaginet, a model of learning visually grounded representations of language from coupled ...
Large language models are known to suffer from the hallucination problem in that they are prone to o...
Large language models are known to suffer from the hallucination problem in that they are prone to o...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
This electronic version was submitted by the student author. The certified thesis is available in th...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Existing vision-language methods typically support two languages at a time at most. In this paper, w...