Common video representations often deploy an average or maximum pooling of pre-extracted frame features over time. Such an approach provides a simple means to encode feature distributions, but is likely to be suboptimal. As an alternative, we here explore combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors , NetVLAD, GRU and LSTM to aggregate video features over time. We also introduce a learnable non-linear network unit, named Context Gating, aiming at modeling in-terdependencies between features. We evaluate the method on the multi-modal Youtube-8M Large-Scale Video Understanding dataset using pre-extracted visual and audio features. We demonstrate improvements provided by the Context Gating as well as b...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
open6siIn this paper we evaluate three state-of-the-art neural-network-based approaches for large-sc...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
Common video representations often deploy an average or maximum pooling of pre-extracted frame featu...
Project page: https://rohitgirdhar.github.io/ActionVLAD/International audienceIn this work, we intro...
University of Technology Sydney. Faculty of Engineering and Information Technology.Video understandi...
Current state-of-the art object detection and recognition algorithms mainly use supervised training,...
Deep neural networks have recently achieved competitive accuracy for human activity recognition. How...
Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a sin...
Current deep learning based video classification architectures are typically trained end-to-end on l...
International audienceAnnotating videos is cumbersome, expensive and not scalable. Yet, many strong ...
Technological innovation in the field of video action recognition drives the development of video-ba...
International audienceThis paper makes two complementary contributions to event retrieval in large c...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
open6siIn this paper we evaluate three state-of-the-art neural-network-based approaches for large-sc...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
Common video representations often deploy an average or maximum pooling of pre-extracted frame featu...
Project page: https://rohitgirdhar.github.io/ActionVLAD/International audienceIn this work, we intro...
University of Technology Sydney. Faculty of Engineering and Information Technology.Video understandi...
Current state-of-the art object detection and recognition algorithms mainly use supervised training,...
Deep neural networks have recently achieved competitive accuracy for human activity recognition. How...
Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a sin...
Current deep learning based video classification architectures are typically trained end-to-end on l...
International audienceAnnotating videos is cumbersome, expensive and not scalable. Yet, many strong ...
Technological innovation in the field of video action recognition drives the development of video-ba...
International audienceThis paper makes two complementary contributions to event retrieval in large c...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine le...
open6siIn this paper we evaluate three state-of-the-art neural-network-based approaches for large-sc...
The quality of the image representations obtained from self-supervised learning depends strongly on ...