We investigate the efficacy of visual prompting to adapt large-scale models in vision. Following the recent approach from prompt tuning and adversarial reprogramming, we learn a single image perturbation such that a frozen model prompted with this perturbation performs a new task. Through comprehensive experiments, we demonstrate that visual prompting is particularly effective for CLIP and robust to distribution shift, achieving performance competitive with standard linear probes. We further analyze properties of the downstream dataset, prompt design, and output transformation in regard to adaptation performance. The surprising effectiveness of visual prompting provides a new perspective on adapting pre-trained models in vision. Code is ava...
The size of vision models has grown exponentially over the last few years, especially after the emer...
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can ...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
We present prompt distribution learning for effectively adapting a pre-trained vision-language model...
Since the rise of powerful large-scale pre-trained Vision-Language (VL) models, such as CLIP and ALI...
In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models t...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
With the increasing attention to large vision-language models such as CLIP, there has been a signifi...
Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural ...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
The size of vision models has grown exponentially over the last few years, especially after the emer...
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can ...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
We present prompt distribution learning for effectively adapting a pre-trained vision-language model...
Since the rise of powerful large-scale pre-trained Vision-Language (VL) models, such as CLIP and ALI...
In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models t...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
With the increasing attention to large vision-language models such as CLIP, there has been a signifi...
Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural ...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
The size of vision models has grown exponentially over the last few years, especially after the emer...
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...