Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. A complete and fair benchmark (i.e., including large-scale pre-training datasets and diverse downstream tasks) is essential for VLP. While there are plenty of benchmarks with English corpus, building a rich benchmark for VLP with other languages, such as Chinese, remains a critical problem. To this end, we build a large-scale Chinese cross-modal benchmark called Zero for the research community to fairly compare VLP models. We release two pre-training datasets and five fine-tuning datasets for downstream tasks. Alongside, we propose a novel pre-training framework of pre-Ranking + Ranking for cross-modal learning. Specifically...
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream ta...
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of c...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models t...
Pre-trained vision language models (VL) have seen a rise in recent years, achieving state-of-the-art...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream ta...
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of c...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and l...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language represent...
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models t...
Pre-trained vision language models (VL) have seen a rise in recent years, achieving state-of-the-art...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regre...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...