Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can ...
Grouping and recognition are important components of visual scene understanding, e.g., for object de...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot...
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's...
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classificati...
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary...
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to te...
To bridge the gap between supervised semantic segmentation and real-world applications that acquires...
When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solv...
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from...
Fully supervised semantic segmentation learns from dense masks, which requires heavy annotation cost...
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The ...
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabular...
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose ...
We design an open-vocabulary image segmentation model to organize an image into meaningful regions i...
Grouping and recognition are important components of visual scene understanding, e.g., for object de...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot...
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's...
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classificati...
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary...
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to te...
To bridge the gap between supervised semantic segmentation and real-world applications that acquires...
When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solv...
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from...
Fully supervised semantic segmentation learns from dense masks, which requires heavy annotation cost...
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The ...
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabular...
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose ...
We design an open-vocabulary image segmentation model to organize an image into meaningful regions i...
Grouping and recognition are important components of visual scene understanding, e.g., for object de...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot...