Sequence-to-sequence models incorporated with attention mechanism have shown promising improvements on video captioning. While there is rich information both inside and between frames, spatial attention is rarely explored and motion information is usually handled by 3D-CNNs as just another modality for fusion. On the other hand, researches about human perception suggest that apparent motion can attract attention. Motivated by this, we aim to learn spatial attention on video frames under the guidance of motion information for caption generation. We present a novel video captioning framework by utilizing Motion Guided Spatial Attention (MGSA). The proposed MGSA exploits the motion between video frames by learning spatial attention from stacke...
An adaptive spatiotemporal saliency algorithm for video attention detection using motion vector deci...
Motion, as the uniqueness of a video, has been critical to the development of video understanding mo...
Previous models for video captioning often use the output from a specific layer of a Convolutional N...
Human vision system actively seeks interesting regions in images to reduce the search effort in task...
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A ...
Abstract—This paper presents a new model of human attention that allows salient areas to be extracte...
Video understanding has become increasingly important as surveillance, social, and informational vid...
Human vision has the natural cognitive ability to focus on salient objects or areas when watching st...
Human vision system actively seeks salient regions and movements in video sequences to reduce the se...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...
International audienceThis paper presents a new model of human attention that allows salient areas t...
In recent years, deep learning techniques have excelled in video action recognition. However, curren...
Convolutional neural networks have achieved excellent successes for object recognition in still imag...
Automated analysis of videos for content understanding is one of the most challenging and well resea...
© 2013 IEEE. Video captioning has been attracting broad research attention in the multimedia communi...
An adaptive spatiotemporal saliency algorithm for video attention detection using motion vector deci...
Motion, as the uniqueness of a video, has been critical to the development of video understanding mo...
Previous models for video captioning often use the output from a specific layer of a Convolutional N...
Human vision system actively seeks interesting regions in images to reduce the search effort in task...
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A ...
Abstract—This paper presents a new model of human attention that allows salient areas to be extracte...
Video understanding has become increasingly important as surveillance, social, and informational vid...
Human vision has the natural cognitive ability to focus on salient objects or areas when watching st...
Human vision system actively seeks salient regions and movements in video sequences to reduce the se...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...
International audienceThis paper presents a new model of human attention that allows salient areas t...
In recent years, deep learning techniques have excelled in video action recognition. However, curren...
Convolutional neural networks have achieved excellent successes for object recognition in still imag...
Automated analysis of videos for content understanding is one of the most challenging and well resea...
© 2013 IEEE. Video captioning has been attracting broad research attention in the multimedia communi...
An adaptive spatiotemporal saliency algorithm for video attention detection using motion vector deci...
Motion, as the uniqueness of a video, has been critical to the development of video understanding mo...
Previous models for video captioning often use the output from a specific layer of a Convolutional N...