This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks,...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
The extent to which text-only language models (LMs) learn to represent the physical, non-linguistic ...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reas...
Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilitie...
The integration of visual inputs with large language models (LLMs) has led to remarkable advancement...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
Scene text recognition (STR) enables computers to recognize and read the text in various real-world ...
Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natur...
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevail...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...
The extent to which text-only language models (LMs) learn to represent the physical, non-linguistic ...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reas...
Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilitie...
The integration of visual inputs with large language models (LLMs) has led to remarkable advancement...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
Scene text recognition (STR) enables computers to recognize and read the text in various real-world ...
Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natur...
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevail...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
In recent years, joint text-image embeddings have significantly improved thanks to the development o...