The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pret...
This electronic version was submitted by the student author. The certified thesis is available in th...
The problem of describing images through natural lan-guage has gained importance in the computer vis...
Few-shot learning aims to train models that can be generalized to novel classes with only a few samp...
Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an enc...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Building models that can be rapidly adapted to novel tasks using only a handful of annotated example...
Training an effective video-and-language model intuitively requires multiple frames as model inputs....
International audienceRecent vision-language models are driven by large-scale pretrained models. How...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder...
This paper presents a novel approach for automatically generating image descriptions: visual detecto...
Recent video and language pretraining frameworks lack the ability to generate sentences. We present ...
This paper integrates techniques in natural language processing and computer vision to improve recog...
Generating natural language descriptions for visual data links computer vision and computational lin...
Contrastive language-image pretraining has shown great success in learning visual-textual joint repr...
This electronic version was submitted by the student author. The certified thesis is available in th...
The problem of describing images through natural lan-guage has gained importance in the computer vis...
Few-shot learning aims to train models that can be generalized to novel classes with only a few samp...
Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an enc...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Building models that can be rapidly adapted to novel tasks using only a handful of annotated example...
Training an effective video-and-language model intuitively requires multiple frames as model inputs....
International audienceRecent vision-language models are driven by large-scale pretrained models. How...
Linking natural language to visual data is an important topic at the intersection of Natural Languag...
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder...
This paper presents a novel approach for automatically generating image descriptions: visual detecto...
Recent video and language pretraining frameworks lack the ability to generate sentences. We present ...
This paper integrates techniques in natural language processing and computer vision to improve recog...
Generating natural language descriptions for visual data links computer vision and computational lin...
Contrastive language-image pretraining has shown great success in learning visual-textual joint repr...
This electronic version was submitted by the student author. The certified thesis is available in th...
The problem of describing images through natural lan-guage has gained importance in the computer vis...
Few-shot learning aims to train models that can be generalized to novel classes with only a few samp...