Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captio...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal dow...
The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide ra...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to joint...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Recent work in vision-and-language pretraining has investigated supervised signals from object detec...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
The multimedia community has shown a significant interest in perceiving and representing the physica...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal dow...
The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide ra...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to joint...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Recent work in vision-and-language pretraining has investigated supervised signals from object detec...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
The multimedia community has shown a significant interest in perceiving and representing the physica...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal dow...
The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide ra...