The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporalaware video feature...
Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or vid...
Recent work has shown that the integration of visual information into text-based models can substant...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
The correlation between the vision and text is essential for video moment retrieval (VMR), however, ...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-...
This paper studies the problem of temporal moment localization in a long untrimmed video using natur...
International audienceOur objective in this work is video-text retrieval - in particular a joint emb...
Our objective in this work is video-text retrieval – in particular a joint embedding that enables ef...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
Video-Text pre-training aims at learning transferable representations from large-scale video-text pa...
Video moment retrieval pursues an efficient and generalized solution to identify the specific tempor...
Although large-scale video-language pre-training models, which usually build a global alignment betw...
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent yea...
Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or vid...
Recent work has shown that the integration of visual information into text-based models can substant...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
The correlation between the vision and text is essential for video moment retrieval (VMR), however, ...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-...
This paper studies the problem of temporal moment localization in a long untrimmed video using natur...
International audienceOur objective in this work is video-text retrieval - in particular a joint emb...
Our objective in this work is video-text retrieval – in particular a joint embedding that enables ef...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
Video-Text pre-training aims at learning transferable representations from large-scale video-text pa...
Video moment retrieval pursues an efficient and generalized solution to identify the specific tempor...
Although large-scale video-language pre-training models, which usually build a global alignment betw...
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent yea...
Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or vid...
Recent work has shown that the integration of visual information into text-based models can substant...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...