This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a new hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on...
This thesis tackles the challenge of learning the abstract structure of object categories without ma...
Knowledge distillation, which is a process of transferring complex knowledge learned by a heavy netw...
The objective of this paper is visual-only self-supervised video representation learning. We make th...
We propose a self-supervised framework for learning facial attributes by simply watching videos of a...
We propose a new method for recognizing the pose of objects from a single image that for learning us...
International audienceAction recognition based on human pose has witnessed increasing attention due ...
This thesis explores how a computer can learn the structure of visual objects in the absence of stro...
Self-supervised learning in video involves learning representations without using high-cost labels, ...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
With the rapid advancement of deep learning techniques in computer vision, researchers have achieved...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
This paper proposes a novel pretext task to address the self-supervised video representation learnin...
We address the problem of recognizing the pose of an object category from video sequences capturing ...
International audienceThe ability to localize and segment objects from unseen classes would open the...
We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos t...
This thesis tackles the challenge of learning the abstract structure of object categories without ma...
Knowledge distillation, which is a process of transferring complex knowledge learned by a heavy netw...
The objective of this paper is visual-only self-supervised video representation learning. We make th...
We propose a self-supervised framework for learning facial attributes by simply watching videos of a...
We propose a new method for recognizing the pose of objects from a single image that for learning us...
International audienceAction recognition based on human pose has witnessed increasing attention due ...
This thesis explores how a computer can learn the structure of visual objects in the absence of stro...
Self-supervised learning in video involves learning representations without using high-cost labels, ...
The advent of deep learning has brought about great progress on many funda- mental computer vision t...
With the rapid advancement of deep learning techniques in computer vision, researchers have achieved...
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervi...
This paper proposes a novel pretext task to address the self-supervised video representation learnin...
We address the problem of recognizing the pose of an object category from video sequences capturing ...
International audienceThe ability to localize and segment objects from unseen classes would open the...
We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos t...
This thesis tackles the challenge of learning the abstract structure of object categories without ma...
Knowledge distillation, which is a process of transferring complex knowledge learned by a heavy netw...
The objective of this paper is visual-only self-supervised video representation learning. We make th...