Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additiona...
In this paper, we newly introduce the concept of temporal attention filters, and describe how they c...
We generate massive amounts of video data every day. While most real-world videos are long and untri...
Most recent approaches for action recognition from video leverage deep architectures to encode the v...
Long-Term activities involve humans performing complex, minutes-long actions. Differently than in tr...
This thesis focuses on video understanding for human action and interaction recognition. We start by...
Automated analysis of videos for content understanding is one of the most challenging and well resea...
This thesis contributes to the literature of understanding and recognizing human activities in video...
Human action recognition plays a crucial role in various applications, including video surveillance,...
In most of the existing work for activity recognition, 3D ConvNets show promising performance for le...
We introduce a simple yet effective network that embeds a novel Discriminative Feature Pooling (DFP)...
In recent years, deep learning techniques have excelled in video action recognition. However, curren...
Local spatio-temporal features have been shown to be effective and robust in order to represent simp...
Human activity recognition in videos with convolutional neural network (CNN) features has received i...
2014-09-22Human action recognition in videos is a central problem of computer vision, with numerous ...
Spatiotemporal and motion feature representations are the key to video action recognition. Typical p...
In this paper, we newly introduce the concept of temporal attention filters, and describe how they c...
We generate massive amounts of video data every day. While most real-world videos are long and untri...
Most recent approaches for action recognition from video leverage deep architectures to encode the v...
Long-Term activities involve humans performing complex, minutes-long actions. Differently than in tr...
This thesis focuses on video understanding for human action and interaction recognition. We start by...
Automated analysis of videos for content understanding is one of the most challenging and well resea...
This thesis contributes to the literature of understanding and recognizing human activities in video...
Human action recognition plays a crucial role in various applications, including video surveillance,...
In most of the existing work for activity recognition, 3D ConvNets show promising performance for le...
We introduce a simple yet effective network that embeds a novel Discriminative Feature Pooling (DFP)...
In recent years, deep learning techniques have excelled in video action recognition. However, curren...
Local spatio-temporal features have been shown to be effective and robust in order to represent simp...
Human activity recognition in videos with convolutional neural network (CNN) features has received i...
2014-09-22Human action recognition in videos is a central problem of computer vision, with numerous ...
Spatiotemporal and motion feature representations are the key to video action recognition. Typical p...
In this paper, we newly introduce the concept of temporal attention filters, and describe how they c...
We generate massive amounts of video data every day. While most real-world videos are long and untri...
Most recent approaches for action recognition from video leverage deep architectures to encode the v...