Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image ...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-tuning),...
Large vision-language representation learning models like CLIP have demonstrated impressive performa...
Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data dist...
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research eff...
The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-trai...
As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many de...
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary...
In recent years, convolutional neural networks have achieved state-of-the-art performance in a numbe...
The impressive performances of deep learning architectures is associated to massive increase of mode...
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource f...
Although massive pre-trained vision-language models like CLIP show impressive generalization capabil...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-tuning),...
Large vision-language representation learning models like CLIP have demonstrated impressive performa...
Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data dist...
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research eff...
The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-trai...
As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many de...
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary...
In recent years, convolutional neural networks have achieved state-of-the-art performance in a numbe...
The impressive performances of deep learning architectures is associated to massive increase of mode...
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource f...
Although massive pre-trained vision-language models like CLIP show impressive generalization capabil...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
The current modus operandi in adapting pre-trained models involves updating all the backbone paramet...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...