This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual syste...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
We present a novel hierarchical and distributed model for learning invariant spatio-temporal feature...
This thesis presents a novel self-supervised approach of learning visual representations from videos...
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visu...
We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich...
Shenzhen Science and Technology Projects (Grant Number: JCYJ20200109143035495 and JCYJ20180306173210...
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, su...
We pose video colorization as a self-supervised learning problem for visual tracking. We use large a...
Summarization: We present a technique for the summarization and spatiotemporal scaling of video cont...
Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective mot...
This paper instroduces an unsupervised framework to extract semantically rich features for video rep...
© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
This paper describes a method for building visual scene models from video data using quantized descr...
In most of the existing work for activity recognition, 3D ConvNets show promising performance for le...
Automated analysis of videos for content understanding is one of the most challenging and well resea...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
We present a novel hierarchical and distributed model for learning invariant spatio-temporal feature...
This thesis presents a novel self-supervised approach of learning visual representations from videos...
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visu...
We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich...
Shenzhen Science and Technology Projects (Grant Number: JCYJ20200109143035495 and JCYJ20180306173210...
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, su...
We pose video colorization as a self-supervised learning problem for visual tracking. We use large a...
Summarization: We present a technique for the summarization and spatiotemporal scaling of video cont...
Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective mot...
This paper instroduces an unsupervised framework to extract semantically rich features for video rep...
© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
This paper describes a method for building visual scene models from video data using quantized descr...
In most of the existing work for activity recognition, 3D ConvNets show promising performance for le...
Automated analysis of videos for content understanding is one of the most challenging and well resea...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
We present a novel hierarchical and distributed model for learning invariant spatio-temporal feature...
This thesis presents a novel self-supervised approach of learning visual representations from videos...