Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component ana...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained visi...
Most Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a ba...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Visually grounded speech systems learn from paired images and their spoken captions. Recently, there...
Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the...
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of t...
Tasks that require modeling of both language and visual information such as image captioning have be...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained visi...
Most Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a ba...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified em...
Images can be described in terms of the objects they contain, or in terms of the types of scene or p...
Contrastive learning is a form of distance learning that aims to learn invariant features from two r...
Visually grounded speech systems learn from paired images and their spoken captions. Recently, there...
Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the...
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of t...
Tasks that require modeling of both language and visual information such as image captioning have be...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual repr...
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained visi...
Most Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a ba...