Project page: https://rohitgirdhar.github.io/ActionVLAD/International audienceIn this work, we introduce a new video representation for action classification that aggregates local convolu-tional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representa...
As the success of deep models has led to their deployment in all areas of computer vision, it is inc...
International audienceWe propose an effective approach for action localization, both in the spatial ...
Most video based action recognition approaches create the video-level representation by temporally p...
Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not ye...
Spatiotemporal and motion feature representations are the key to video action recognition. Typical p...
This thesis focuses on video understanding for human action and interaction recognition. We start by...
Classification of human actions from real-world video data is one of the most important topics in co...
Action recognition methods enable several intelligent machines to recognize human action in their da...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
Fernando B., Gavves E., Oramas Mogrovejo J., Ghodrati A., Tuytelaars T., ''Modeling video evolution ...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal ...
As the success of deep models has led to their deployment in all areas of computer vision, it is inc...
In this paper, several variants of two-stream architectures for temporal action proposal generation ...
This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal ...
As the success of deep models has led to their deployment in all areas of computer vision, it is inc...
International audienceWe propose an effective approach for action localization, both in the spatial ...
Most video based action recognition approaches create the video-level representation by temporally p...
Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not ye...
Spatiotemporal and motion feature representations are the key to video action recognition. Typical p...
This thesis focuses on video understanding for human action and interaction recognition. We start by...
Classification of human actions from real-world video data is one of the most important topics in co...
Action recognition methods enable several intelligent machines to recognize human action in their da...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
Fernando B., Gavves E., Oramas Mogrovejo J., Ghodrati A., Tuytelaars T., ''Modeling video evolution ...
We propose a function-based temporal pooling method that captures the latent structure of the video ...
This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal ...
As the success of deep models has led to their deployment in all areas of computer vision, it is inc...
In this paper, several variants of two-stream architectures for temporal action proposal generation ...
This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal ...
As the success of deep models has led to their deployment in all areas of computer vision, it is inc...
International audienceWe propose an effective approach for action localization, both in the spatial ...
Most video based action recognition approaches create the video-level representation by temporally p...