Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention...
Given a textual phrase and an image, the visual grounding problem is the task of locating the conten...
Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and obj...
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the targ...
© 2019 Association for Computational Linguistics. Grounding referring expressions to objects in an e...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Visual Grounding (VG) is a task of locating a specific object in an image semantically matching a gi...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
We propose a margin-based loss for vision-language model pretraining that encourages gradient-based ...
Visual grounding, i.e., localizing objects in images according to natural language queries, is an im...
Cross-modal attention mechanisms have been widely applied to the image-text matching task and have a...
Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challe...
In this paper, we are tackling the weakly-supervised referring expression grounding task, for the lo...
In this paper, we introduce a contextual grounding approach that captures the context in correspondi...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Despite recent progress towards scaling up multimodal vision-language models, these models are still...
Given a textual phrase and an image, the visual grounding problem is the task of locating the conten...
Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and obj...
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the targ...
© 2019 Association for Computational Linguistics. Grounding referring expressions to objects in an e...
Recently, the cross-modal pre-training task has been a hotspot because of its wide application in va...
Visual Grounding (VG) is a task of locating a specific object in an image semantically matching a gi...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
We propose a margin-based loss for vision-language model pretraining that encourages gradient-based ...
Visual grounding, i.e., localizing objects in images according to natural language queries, is an im...
Cross-modal attention mechanisms have been widely applied to the image-text matching task and have a...
Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challe...
In this paper, we are tackling the weakly-supervised referring expression grounding task, for the lo...
In this paper, we introduce a contextual grounding approach that captures the context in correspondi...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
Despite recent progress towards scaling up multimodal vision-language models, these models are still...
Given a textual phrase and an image, the visual grounding problem is the task of locating the conten...
Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and obj...
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the targ...