Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions....
Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various ...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Most existing methods in vision language pre-training rely on object-centric features extracted thro...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models t...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various ...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Most existing methods in vision language pre-training rely on object-centric features extracted thro...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models t...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various ...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...