Pre-trained vision language models (VL) have seen a rise in recent years, achieving state-of-the-art performance on tasks such as visual question answering, image captioning, zero-shot tasks, text-to-image synthesis, and many more. However, these models are large, also requiring large amounts of data, which can hinder application on resource-limited settings. This paper proposes CAPIT (Cross Attention on Pre-trained Image and Text models), a novel architecture built on top of frozen pre-trained unimodal encoders, utilizing transferable knowledge from pre-trainined models through cross-attentional transformers. CAPIT is trained with a simple supervised task that learns to predict the correspondence between image-text pairs. It is then tested...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an importa...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
Current language models have been criticised for learning language from text alone without connectio...
Recent work has shown that self-supervised pre-training leads to improvements over supervised learni...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
CLIP proved that aligning visual and language spaces is key to solving many vision tasks without exp...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an importa...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
Current language models have been criticised for learning language from text alone without connectio...
Recent work has shown that self-supervised pre-training leads to improvements over supervised learni...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
CLIP proved that aligning visual and language spaces is key to solving many vision tasks without exp...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an importa...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...