The problem of video captioning has been heavily investigated from the research community the last years and, especially, since Recurrent Neural Networks (RNNs) have been introduced. Aforementioned approaches of video captioning, are usually based on sequence-to-sequence models that aim to exploit the visual information by detecting events, objects, or via matching entities to words. However, the exploitation of the contextual information that can be extracted from the vocabulary has not been investigated yet, except from approaches that make use of parts of speech such as verbs, nouns, and adjectives. The proposed approach is based on the assumption that textually similar captions should represent similar visual content. Specifically, we p...
Recent research in Deep Learning has sent the quality of results in multimedia tasks rocketing: than...
This is a high level computer vision paper, which employs concepts from Natural Language Understandi...
Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with ...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
Video captioning, i.e., the task of generating captions from video sequences creates a bridge betwee...
This paper presents a novel approach for automatically generating image descriptions: visual detecto...
Automatic video description, or video captioning, is a challenging yet much attractive task. It aims...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacin...
The dominant paradigm for learning video-text representations – noise contrastive learning – increas...
Recent work has shown that the integration of visual information into text-based models can substant...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...
Recent research in Deep Learning has sent the quality of results in multimedia tasks rocketing: than...
This is a high level computer vision paper, which employs concepts from Natural Language Understandi...
Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with ...
Video captioning refers to the task of generating a natural language sentence that explains the cont...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
Video captioning, i.e., the task of generating captions from video sequences creates a bridge betwee...
This paper presents a novel approach for automatically generating image descriptions: visual detecto...
Automatic video description, or video captioning, is a challenging yet much attractive task. It aims...
We address the problem of text-based activity retrieval in video. Given a sentence describing an act...
This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacin...
The dominant paradigm for learning video-text representations – noise contrastive learning – increas...
Recent work has shown that the integration of visual information into text-based models can substant...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism...
Abstract Dense video captioning (DVC) detects multiple events in an input video and generates natura...
Recent research in Deep Learning has sent the quality of results in multimedia tasks rocketing: than...
This is a high level computer vision paper, which employs concepts from Natural Language Understandi...
Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with ...