Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aL...
International audiencePre-training on large scale unlabelled datasets has shown impressive performan...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
International audienceThe task of retrieving video content relevant to natural language queries play...
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Systems that can find correspondences between multiple modal- ities, such as between speech and imag...
To solve video-and-language grounding tasks, the key is for the network to understand the connection...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...
Text-video retrieval is a challenging task that aims to search relevant video contents based on natu...
We propose an unsupervised learning algorithm for automatically inferring the mappings between Engli...
Thesis (Ph. D.)--University of Rochester. Department of Computer Science, 2016.Today we encounter la...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
International audiencePre-training on large scale unlabelled datasets has shown impressive performan...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
International audienceThe task of retrieving video content relevant to natural language queries play...
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal ...
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search appli...
Systems that can find correspondences between multiple modal- ities, such as between speech and imag...
To solve video-and-language grounding tasks, the key is for the network to understand the connection...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...
Text-video retrieval is a challenging task that aims to search relevant video contents based on natu...
We propose an unsupervised learning algorithm for automatically inferring the mappings between Engli...
Thesis (Ph. D.)--University of Rochester. Department of Computer Science, 2016.Today we encounter la...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
International audiencePre-training on large scale unlabelled datasets has shown impressive performan...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
International audienceThe task of retrieving video content relevant to natural language queries play...