We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently, and finetune the model with high-resolution data in the end. Since the complexity of the vision transformer heavily depends on input image size, our approach significantly reduces the training resource requirements both in theory and in practice. Using the same batch size and training epoch, RECLIP achieves highly competitive zero-shot classification and image-text retrieval accuracy with 6 to 8x less computational resources and 7 to 9x few...
Large vision-language representation learning models like CLIP have demonstrated impressive performa...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
Few-shot classification requires deep neural networks to learn generalized representations only from...
The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-trai...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
Training a referring expression comprehension (ReC) model for a new visual domain requires collectin...
Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced performance across a ra...
Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal task...
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary...
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of c...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
Image-text contrastive models such as CLIP are useful for a variety of downstream applications inclu...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Large vision-language representation learning models like CLIP have demonstrated impressive performa...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
Few-shot classification requires deep neural networks to learn generalized representations only from...
The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-trai...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
Training a referring expression comprehension (ReC) model for a new visual domain requires collectin...
Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced performance across a ra...
Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal task...
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary...
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of c...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
Image-text contrastive models such as CLIP are useful for a variety of downstream applications inclu...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Large vision-language representation learning models like CLIP have demonstrated impressive performa...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
Few-shot classification requires deep neural networks to learn generalized representations only from...