We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses tra...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting ...
Automated methods for analyzing human activities from video or sensor data are critical for enabling...
Action classification has made great progress, but segmenting and recognizing actions from long untr...
In this dissertation, I present my work towards exploring temporal information for better video unde...
In this dissertation, I present my work towards exploring temporal information for better video unde...
Recent temporal action segmentation approaches need frame annotations during training to be effectiv...
In temporal action localization, given an input video, the goal is to predict which actions it conta...
Understanding human actions in videos is of great interest in various scenarios ranging from surveil...
Temporal segmentation of events is an essential task and a precursor for the automatic recognition o...
In this paper, we propose Hierarchical Action Segmentation Refiner (HASR), which can refine temporal...
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categor...
In this report, we present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2...
We present a novel approach for unsupervised activity segmentation which uses video frame clustering...
In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cook...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting ...
Automated methods for analyzing human activities from video or sensor data are critical for enabling...
Action classification has made great progress, but segmenting and recognizing actions from long untr...
In this dissertation, I present my work towards exploring temporal information for better video unde...
In this dissertation, I present my work towards exploring temporal information for better video unde...
Recent temporal action segmentation approaches need frame annotations during training to be effectiv...
In temporal action localization, given an input video, the goal is to predict which actions it conta...
Understanding human actions in videos is of great interest in various scenarios ranging from surveil...
Temporal segmentation of events is an essential task and a precursor for the automatic recognition o...
In this paper, we propose Hierarchical Action Segmentation Refiner (HASR), which can refine temporal...
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categor...
In this report, we present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2...
We present a novel approach for unsupervised activity segmentation which uses video frame clustering...
In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cook...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting ...
Automated methods for analyzing human activities from video or sensor data are critical for enabling...