We propose a margin-based loss for vision-language model pretraining that encourages gradient-based explanations that are consistent with region-level annotations. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding performance compared to models that rely instead on region-level annotations for explicitly training an object detector such as Faster R-CNN. AMC works by encouraging gradient-based explanation masks that focus their attention scores mostly within annotated regions of interest for images that contain such annotations. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.59% in t...
Most models tasked to ground referential utterances in 2D and 3D scenes learn to select the referred...
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e....
Existing visual explanation generating agents learn to fluently justify a class prediction. Conseque...
Visual Grounding (VG) is a task of locating a specific object in an image semantically matching a gi...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the targ...
Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural ...
This paper analyzes the predictions of image captioning models with attention mechanisms beyond visu...
Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the...
Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal t...
Reference-based line-art colorization is a challenging task in computer vision. The color, texture, ...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
Referring expression grounding is an important and challenging task in computer vision. To avoid the...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Most models tasked to ground referential utterances in 2D and 3D scenes learn to select the referred...
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e....
Existing visual explanation generating agents learn to fluently justify a class prediction. Conseque...
Visual Grounding (VG) is a task of locating a specific object in an image semantically matching a gi...
We present a new paradigm for fine-tuning large-scale vision-language pre-trained models on downstre...
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the targ...
Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural ...
This paper analyzes the predictions of image captioning models with attention mechanisms beyond visu...
Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the...
Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal t...
Reference-based line-art colorization is a challenging task in computer vision. The color, texture, ...
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles ...
Referring expression grounding is an important and challenging task in computer vision. To avoid the...
Large pre-trained vision-language models like CLIP have shown great potential in learning representa...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretra...
Most models tasked to ground referential utterances in 2D and 3D scenes learn to select the referred...
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e....
Existing visual explanation generating agents learn to fluently justify a class prediction. Conseque...