Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism has been shown to produce effective representations of multi-modalities. The attention mechanism takes two modalities as the queries and keys and maps the combination of them into the query domain. This thesis aims at studying the use of the attention mechanism specifically for the dense video captioning task, which concentrates on generating a paragraph describing events in a video segment. While applying the attention mechanism to dense video captioning, the textual and visual contexts are normally taken as the queries and keys, respectively. Additionally, the vision-language contexts from the current segment and the history segments could...
Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the...
Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
Dense video captioning is a task of localizing interesting events from an untrimmed video and produc...
The canonical approach to video captioning dictates a caption generation model to learn from offline...
Video captioning, i.e., the task of generating captions from video sequences creates a bridge betwee...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...
With the maturity of computer vision and natural language processing technology, we are becoming mor...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the...
Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
Dense video captioning is a task of localizing interesting events from an untrimmed video and produc...
The canonical approach to video captioning dictates a caption generation model to learn from offline...
Video captioning, i.e., the task of generating captions from video sequences creates a bridge betwee...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...
With the maturity of computer vision and natural language processing technology, we are becoming mor...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the...
Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the...
International audienceWe propose ``Areas of Attention'', a novel attention-based model for automatic...