Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentence...
Memory processes have undergone extensive investigation using various experimental methods. While wo...
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired peopl...
Most of the recent progresses on visual question answering are based on recurrent neural networks (R...
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both vide...
© 2017 ACM. Recently, a new type of video understanding task called Movie-Fillin- the-Blank (MovieFI...
International audienceDiscovering content and stories in movies is one of the most important concept...
This thesis explores a computer's ability to understand multimodal data where the correspondence bet...
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in visi...
Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question A...
IEEE Visual narrating focuses on generating semantic descriptions to summarize visual content of ima...
© 1999-2012 IEEE. Recent progress in using long short-term memory (LSTM) for image captioning has mo...
Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the explo...
Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks...
This paper proposes to improve visual question answering (VQA) with structured representations of bo...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Memory processes have undergone extensive investigation using various experimental methods. While wo...
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired peopl...
Most of the recent progresses on visual question answering are based on recurrent neural networks (R...
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both vide...
© 2017 ACM. Recently, a new type of video understanding task called Movie-Fillin- the-Blank (MovieFI...
International audienceDiscovering content and stories in movies is one of the most important concept...
This thesis explores a computer's ability to understand multimodal data where the correspondence bet...
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in visi...
Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question A...
IEEE Visual narrating focuses on generating semantic descriptions to summarize visual content of ima...
© 1999-2012 IEEE. Recent progress in using long short-term memory (LSTM) for image captioning has mo...
Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the explo...
Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks...
This paper proposes to improve visual question answering (VQA) with structured representations of bo...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Memory processes have undergone extensive investigation using various experimental methods. While wo...
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired peopl...
Most of the recent progresses on visual question answering are based on recurrent neural networks (R...