This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence, and fix it by replacing the inaccurate word(s). Our method leverages the semantic interdependence of videos and words, as well as the short-term and long-term relations of the words in a sentence. Our proposed formulation can solve the VTC problem employing an End-to-End network in two steps: (1) Inaccuracy detection, and (2) correct word prediction. In detection step, each word of a sentence is reconstructed such that the reconstruction for the inaccurate word is maximized. We exploit both Short Te...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
As an alternative approach, viseme-based lipreading systems have demonstrated promising performance ...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
The problem of video captioning has been heavily investigated from the research community the last y...
The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation method...
Given a video and a description sentence with one missing word, \u27source sentence\u27, Video-Fill-...
This paper integrates techniques in natural language processing and computer vision to improve recog...
Data augmentation is one of the ways to deal with labeled data scarcity and overfitting. Both of the...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
This paper is concerned with the study of scene text detection and recognition from blurry natural v...
This paper is concerned with the study of scene text detection and recognition from blurry natural v...
Generating natural language descriptions for visual data links computer vision and computational lin...
As an alternative approach, viseme-based lipreading systems have demonstrated promising performance ...
Current multimodal data processing methods use deep learning to combine complementary visual and tex...
Recently, joint video-language modeling has been attracting more and more attention. However, most e...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
As an alternative approach, viseme-based lipreading systems have demonstrated promising performance ...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...
The problem of video captioning has been heavily investigated from the research community the last y...
The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation method...
Given a video and a description sentence with one missing word, \u27source sentence\u27, Video-Fill-...
This paper integrates techniques in natural language processing and computer vision to improve recog...
Data augmentation is one of the ways to deal with labeled data scarcity and overfitting. Both of the...
This paper strives to find amidst a set of sentences the one best describing the content of a given ...
This paper is concerned with the study of scene text detection and recognition from blurry natural v...
This paper is concerned with the study of scene text detection and recognition from blurry natural v...
Generating natural language descriptions for visual data links computer vision and computational lin...
As an alternative approach, viseme-based lipreading systems have demonstrated promising performance ...
Current multimodal data processing methods use deep learning to combine complementary visual and tex...
Recently, joint video-language modeling has been attracting more and more attention. However, most e...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
As an alternative approach, viseme-based lipreading systems have demonstrated promising performance ...
Video captioning has become a broad and interesting research area. Attention-based encoder-decoder m...