Masked autoencoders (MAEs) have emerged recently as art self-supervised spatiotemporal representation learners. Inheriting from the image counterparts, however, existing video MAEs still focus largely on static appearance learning whilst are limited in learning dynamic temporal information hence less effective for video downstream tasks. To resolve this drawback, in this work we present a motion-aware variant -- MotionMAE. Apart from learning to reconstruct individual masked patches of video frames, our model is designed to additionally predict the corresponding motion structure information over time. This motion information is available at the temporal difference of nearby frames. As a result, our model can extract effectively both static ...
Self-supervised feature learning from video.Understanding the inner workings of deep learning algori...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Several recent works have directly extended the image masked autoencoder (MAE) with random masking i...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
Pre-training video transformers on extra large-scale datasets is generally required to achieve premi...
As the most essential property in a video, motion information is critical to a robust and generalize...
International audienceDue to the remarkable progress of deep generative models, animating images has...
Static appearance of video may impede the ability of a deep neural network to learn motion-relevant ...
We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder an...
Dynamic facial expression recognition (DFER) is essential to the development of intelligent and empa...
Static image action recognition, which aims to recognize action based on a single image, usually rel...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
In this dissertation, I present my work towards exploring temporal information for better video unde...
Self-supervised feature learning from video.Understanding the inner workings of deep learning algori...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Several recent works have directly extended the image masked autoencoder (MAE) with random masking i...
Self-supervised Video Representation Learning (VRL) aims to learn transferrable representations from...
Pre-training video transformers on extra large-scale datasets is generally required to achieve premi...
As the most essential property in a video, motion information is critical to a robust and generalize...
International audienceDue to the remarkable progress of deep generative models, animating images has...
Static appearance of video may impede the ability of a deep neural network to learn motion-relevant ...
We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder an...
Dynamic facial expression recognition (DFER) is essential to the development of intelligent and empa...
Static image action recognition, which aims to recognize action based on a single image, usually rel...
Masked image modeling has been demonstrated as a powerful pretext task for generating robust represe...
The tremendous growth in video data, both on the internet and in real life, has encouraged the devel...
In this dissertation, I present my work towards exploring temporal information for better video unde...
Self-supervised feature learning from video.Understanding the inner workings of deep learning algori...
The quality of the image representations obtained from self-supervised learning depends strongly on ...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...