Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feat...
Thesis (Master's)--University of Washington, 2023In many real-world applications, the frequency dist...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
Although massive pre-trained vision-language models like CLIP show impressive generalization capabil...
With the increasing attention to large vision-language models such as CLIP, there has been a signifi...
In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models t...
Image-text contrastive models such as CLIP are useful for a variety of downstream applications inclu...
Latent image representations arising from vision-language models have proved immensely useful for a ...
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of t...
The extent to which text-only language models (LMs) learn to represent the physical, non-linguistic ...
Thesis (Master's)--University of Washington, 2023In many real-world applications, the frequency dist...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
Although massive pre-trained vision-language models like CLIP show impressive generalization capabil...
With the increasing attention to large vision-language models such as CLIP, there has been a signifi...
In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models t...
Image-text contrastive models such as CLIP are useful for a variety of downstream applications inclu...
Latent image representations arising from vision-language models have proved immensely useful for a ...
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of t...
The extent to which text-only language models (LMs) learn to represent the physical, non-linguistic ...
Thesis (Master's)--University of Washington, 2023In many real-world applications, the frequency dist...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...