Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and ...
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming ...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an importa...
This work explores an efficient approach to establish a foundational video-text model for tasks incl...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
We study joint video and language (VL) pre-training to enable cross-modality learning and benefit pl...
Few-shot action recognition in videos is challenging for its lack of supervision and difficulty in g...
The goal of this work is to build flexible video-language models that can generalize to various vide...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
Video-and-language pre-training has shown promising results for learning generalizable representatio...
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding...
Training an effective video-and-language model intuitively requires multiple frames as model inputs....
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming ...
The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language ...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an importa...
This work explores an efficient approach to establish a foundational video-text model for tasks incl...
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its trans...
We study joint video and language (VL) pre-training to enable cross-modality learning and benefit pl...
Few-shot action recognition in videos is challenging for its lack of supervision and difficulty in g...
The goal of this work is to build flexible video-language models that can generalize to various vide...
Recent advances in pre-training vision-language models like CLIP have shown great potential in learn...
Video-and-language pre-training has shown promising results for learning generalizable representatio...
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding...
Training an effective video-and-language model intuitively requires multiple frames as model inputs....
Vision language pre-training aims to learn alignments between vision and language from a large amoun...
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with p...
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming ...