Vision and language are the primary modalities of our human perception and learning. Recent years have witnessed fast development of methods that connect vision and language. Current deep learning methods are data-hungry, thus pre-training on large-scale data helps warm up the model and shows better fine-tuning results on downstream tasks. However, pre-training frameworks that exploit the power of multi-modality are still underexplored. Specifically, we have the following questions remaining: Could we build large pre-trained models that understand the interactions and alignments between modalities? Could language and vision help the understanding of each other? Could we combine the current diverse methods for vision pre-training and languag...
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
The world around us involves multiple modalities -- we see objects, feel texture, hear sounds, smell...
Most human language understanding is grounded in perception. There is thus growing interest in combi...
Enabling computers to demonstrate a proficient understanding of the physical world is an exceedingly...
Artificial Intelligence (AI) has transformed the way we interact with technology e.g., chatbots, voi...
The multimedia community has shown a significant interest in perceiving and representing the physica...
Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...
In the past few years, the emergence of pre-training models has brought uni-modal fields such as com...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They h...
Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Proc...
The world around us involves multiple modalities -- we see objects, feel texture, hear sounds, smell...
Most human language understanding is grounded in perception. There is thus growing interest in combi...
Enabling computers to demonstrate a proficient understanding of the physical world is an exceedingly...
Artificial Intelligence (AI) has transformed the way we interact with technology e.g., chatbots, voi...
The multimedia community has shown a significant interest in perceiving and representing the physica...
Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks...
Using deep learning, computer vision now rivals people at object recognition and detection, opening ...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that...
Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeli...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...