The extent to which text-only language models (LMs) learn to represent the physical, non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to ``understand'' visual inputs when the models' parameters are updated on image captioning tasks. We test a stronger hypothesis: that the conceptual representations learned by text-only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks. Specifically, we show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual que...
Each time we ask for an object, describe a scene, follow directions or read a document containi...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
This paper reveals that large language models (LLMs), despite being trained solely on textual data, ...
Tasks that require modeling of both language and visual information, such as image captioning, have ...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Current language models have been criticised for learning language from text alone without connectio...
We consider the task of image-captioning using only the CLIP model and additional text data at train...
Current language models have been criticised for learning language from text alone without connectio...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
The problem of learning language models from large text corpora has been widely stud-ied within the ...
Humans can obtain the knowledge of novel visual concepts from language descriptions, and we thus use...
Each time we ask for an object, describe a scene, follow directions or read a document containi...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
This paper reveals that large language models (LLMs), despite being trained solely on textual data, ...
Tasks that require modeling of both language and visual information, such as image captioning, have ...
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answer...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is ...
Current language models have been criticised for learning language from text alone without connectio...
We consider the task of image-captioning using only the CLIP model and additional text data at train...
Current language models have been criticised for learning language from text alone without connectio...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision ...
The problem of learning language models from large text corpora has been widely stud-ied within the ...
Humans can obtain the knowledge of novel visual concepts from language descriptions, and we thus use...
Each time we ask for an object, describe a scene, follow directions or read a document containi...
Powered by deep convolutional networks and large scale visual datasets, modern computer vision syste...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...