Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT en...
Since the rise of powerful large-scale pre-trained Vision-Language (VL) models, such as CLIP and ALI...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation betwe...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal t...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to joint...
Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural ...
In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models t...
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevail...
Since the rise of powerful large-scale pre-trained Vision-Language (VL) models, such as CLIP and ALI...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation betwe...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal t...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to joint...
Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural ...
In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models t...
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevail...
Since the rise of powerful large-scale pre-trained Vision-Language (VL) models, such as CLIP and ALI...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation betwe...