Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However,...
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In ...
Referring expression grounding is an important and challenging task in computer vision. To avoid the...
Referring Expression Generation (reg) algorithms, a core component of systems that generate text fro...
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource f...
Semantic segmentation has a broad range of applications, but its real-world impact has been signific...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Referring Expression Comprehension (REC) is one of the most important tasks in visual reasoning that...
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained visi...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Neural Referring Expression Generation (REG) models have shown promising results in generating expre...
Referring expressions comprehension is the task of locating the image region described by a natural ...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
We propose a margin-based loss for vision-language model pretraining that encourages gradient-based ...
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In ...
Referring expression grounding is an important and challenging task in computer vision. To avoid the...
Referring Expression Generation (reg) algorithms, a core component of systems that generate text fro...
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource f...
Semantic segmentation has a broad range of applications, but its real-world impact has been signific...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Referring Expression Comprehension (REC) is one of the most important tasks in visual reasoning that...
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained visi...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Neural Referring Expression Generation (REG) models have shown promising results in generating expre...
Referring expressions comprehension is the task of locating the image region described by a natural ...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
We propose a margin-based loss for vision-language model pretraining that encourages gradient-based ...
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In ...
Referring expression grounding is an important and challenging task in computer vision. To avoid the...
Referring Expression Generation (reg) algorithms, a core component of systems that generate text fro...