Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrai...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Applying large scale pre-trained image-language model to video-language tasks has recently become a ...
Deep learning has been overwhelmingly successful in computer vision (CV), natural language processin...
Contrastive language-image pretraining has shown great success in learning visual-textual joint repr...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Graduation date: 2017Access restricted to the OSU Community, at author's request, from December 13, ...
Classification, a \textit{supervised learning} problem, is a technique to categorize a given set of ...
Abstract—Unconstrained video recognition and Deep Convo-lution Network (DCN) are two active topics i...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Pre-trained vision language models (VL) have seen a rise in recent years, achieving state-of-the-art...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
International audienceFine-tuning pre-trained deep networks is a practical way of benefiting from th...
Vision models have improved in popularity and performance on many tasks since the emergence of large...
Understanding visual scenes is a crucial piece in many artificial intelligence applications ranging ...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Applying large scale pre-trained image-language model to video-language tasks has recently become a ...
Deep learning has been overwhelmingly successful in computer vision (CV), natural language processin...
Contrastive language-image pretraining has shown great success in learning visual-textual joint repr...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Graduation date: 2017Access restricted to the OSU Community, at author's request, from December 13, ...
Classification, a \textit{supervised learning} problem, is a technique to categorize a given set of ...
Abstract—Unconstrained video recognition and Deep Convo-lution Network (DCN) are two active topics i...
With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) ta...
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-te...
Pre-trained vision language models (VL) have seen a rise in recent years, achieving state-of-the-art...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
International audienceFine-tuning pre-trained deep networks is a practical way of benefiting from th...
Vision models have improved in popularity and performance on many tasks since the emergence of large...
Understanding visual scenes is a crucial piece in many artificial intelligence applications ranging ...
International audienceVision models trained on multimodal datasets can benefit from the wide availab...
Applying large scale pre-trained image-language model to video-language tasks has recently become a ...
Deep learning has been overwhelmingly successful in computer vision (CV), natural language processin...