While recent large-scale video-language pre-training made great progress in video question answering, the design of spatial modeling of video-language models is less fine-grained than that of image-language models; existing practices of temporal modeling also suffer from weak and noisy alignment between modalities. To learn fine-grained visual understanding, we decouple spatial-temporal modeling and propose a hybrid pipeline, Decoupled Spatial-Temporal Encoders, integrating an image- and a video-language encoder. The former encodes spatial semantics from larger but sparsely sampled frames independently of time, while the latter models temporal dynamics at lower spatial but higher temporal resolution. To help the video-language model learn t...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
Video question--answering is a fundamental task in the field of video understanding. Although curren...
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames...
Despite significant progress in video question answering (VideoQA), existing methods fall short of q...
Video question answering (QA) aims to understand the video scene and underlying plot by answering vi...
Video Question Answering (VideoQA) aims to answer natural language questions according to the given ...
Training an effective video-and-language model intuitively requires multiple frames as model inputs....
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilit...
Video question answering is a challenging task that requires understanding jointly the language inpu...
Although large-scale video-language pre-training models, which usually build a global alignment betw...
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in visi...
Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks...
Video question answering is the task of automatically answering questions about videos. Apart from d...
Understanding temporal dynamics of video is an essential aspect of learning better video representat...
A defining characteristic of natural vision is its ability to withstand a variety of input alteratio...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
Video question--answering is a fundamental task in the field of video understanding. Although curren...
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames...
Despite significant progress in video question answering (VideoQA), existing methods fall short of q...
Video question answering (QA) aims to understand the video scene and underlying plot by answering vi...
Video Question Answering (VideoQA) aims to answer natural language questions according to the given ...
Training an effective video-and-language model intuitively requires multiple frames as model inputs....
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilit...
Video question answering is a challenging task that requires understanding jointly the language inpu...
Although large-scale video-language pre-training models, which usually build a global alignment betw...
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in visi...
Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks...
Video question answering is the task of automatically answering questions about videos. Apart from d...
Understanding temporal dynamics of video is an essential aspect of learning better video representat...
A defining characteristic of natural vision is its ability to withstand a variety of input alteratio...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
Video question--answering is a fundamental task in the field of video understanding. Although curren...
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames...