Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that have modality-specific representation spaces such as zero-shot local...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive atten...
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
International audienceIn recent years, joint text-image embeddings have significantly improved thank...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Language and vision provide complementary information. Integrating both modalities in a single multi...
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive atten...
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...