Common video representations often deploy an average or maximum pooling of pre-extracted frame features over time. Such an approach provides a simple means to encode feature distributions, but is likely to be suboptimal. As an alternative, we here explore combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors , NetVLAD, GRU and LSTM to aggregate video features over time. We also introduce a learnable non-linear network unit, named Context Gating, aiming at modeling in-terdependencies between features. We evaluate the method on the multi-modal Youtube-8M Large-Scale Video Understanding dataset using pre-extracted visual and audio features. We demonstrate improvements provided by the Context Gating as well as b...
One of the major research topics in computer vision is automatic video scene understanding where the...
Complex video analysis is a challenging problem due to the long and sophisticated temporal structure...
Typical video classification methods often divide a video into short clips, do inference on each cli...
Common video representations often deploy an average or maximum pooling of pre-extracted frame featu...
Current state-of-the art object detection and recognition algorithms mainly use supervised training,...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
This thesis compares hand-designed features with features learned by feature learning methods in vid...
Abstract. Real-world videos often contain dynamic backgrounds and evolving people activities, especi...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
Typical video classification methods often divide a video into short clips, do inference on each cli...
Graduation date: 2017Access restricted to the OSU Community, at author's request, from December 13, ...
This paper instroduces an unsupervised framework to extract semantically rich features for video rep...
University of Technology Sydney. Faculty of Engineering and Information Technology.Video understandi...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
One of the major research topics in computer vision is automatic video scene understanding where the...
Complex video analysis is a challenging problem due to the long and sophisticated temporal structure...
Typical video classification methods often divide a video into short clips, do inference on each cli...
Common video representations often deploy an average or maximum pooling of pre-extracted frame featu...
Current state-of-the art object detection and recognition algorithms mainly use supervised training,...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
This thesis compares hand-designed features with features learned by feature learning methods in vid...
Abstract. Real-world videos often contain dynamic backgrounds and evolving people activities, especi...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
Typical video classification methods often divide a video into short clips, do inference on each cli...
Graduation date: 2017Access restricted to the OSU Community, at author's request, from December 13, ...
This paper instroduces an unsupervised framework to extract semantically rich features for video rep...
University of Technology Sydney. Faculty of Engineering and Information Technology.Video understandi...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
One of the major research topics in computer vision is automatic video scene understanding where the...
Complex video analysis is a challenging problem due to the long and sophisticated temporal structure...
Typical video classification methods often divide a video into short clips, do inference on each cli...