We present an architecture for learning semantic multi-modal video representations to learn semantic representations of videos from unlabeled data with transformer architectures. While multi-modal transformer architectures have been shown to increase accuracy of video classification and feature learning tasks, these techniques do not incorporate semantic information. Our Semantic Contrastive Multi-Modal Video Transformer (SCMMVT) takes raw video, audio, and text data and generates semantic multi-modal representations that represent connections and relations between portions of the video. We integrate multiple pre-trained architectures and evaluate feature extraction performance with video action recognition downstream tasks
Abstract Video data are usually represented by high dimensional features. The performance of video s...
In this paper, we propose a novel Spatio-Temporal Video Retrieval model to extract spatio-temporal a...
Description of human activities in videos results not only in detection of actions and objects but a...
We present an architecture for learning semantic multi-modal video representations to learn semantic...
International audienceThe task of retrieving video content relevant to natural language queries play...
Research Doctorate - Doctor of Philosophy (PhD)This thesis investigates the problem of seeking multi...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
The automatic analysis and indexing of multimedia content in general domains are im-portant for a va...
Accurate and efficient video classification demands the fusion of multimodal information and the use...
Active learning has been demonstrated to be a useful tool to reduce human labeling effort for many m...
This paper gives an overview of approaches to video representation targeting semantic analysis for c...
This paper surveys the approaches to video representation, focusing on semantic analysis for content...
Digital video now plays an important role in medical education and healthcare, but our ability to au...
One of the major research topics in computer vision is automatic video scene understanding where the...
This thesis presents a novel approach to video understanding by emulating human perceptual processes...
Abstract Video data are usually represented by high dimensional features. The performance of video s...
In this paper, we propose a novel Spatio-Temporal Video Retrieval model to extract spatio-temporal a...
Description of human activities in videos results not only in detection of actions and objects but a...
We present an architecture for learning semantic multi-modal video representations to learn semantic...
International audienceThe task of retrieving video content relevant to natural language queries play...
Research Doctorate - Doctor of Philosophy (PhD)This thesis investigates the problem of seeking multi...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
The automatic analysis and indexing of multimedia content in general domains are im-portant for a va...
Accurate and efficient video classification demands the fusion of multimodal information and the use...
Active learning has been demonstrated to be a useful tool to reduce human labeling effort for many m...
This paper gives an overview of approaches to video representation targeting semantic analysis for c...
This paper surveys the approaches to video representation, focusing on semantic analysis for content...
Digital video now plays an important role in medical education and healthcare, but our ability to au...
One of the major research topics in computer vision is automatic video scene understanding where the...
This thesis presents a novel approach to video understanding by emulating human perceptual processes...
Abstract Video data are usually represented by high dimensional features. The performance of video s...
In this paper, we propose a novel Spatio-Temporal Video Retrieval model to extract spatio-temporal a...
Description of human activities in videos results not only in detection of actions and objects but a...