We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additio...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
In low-level video analyses, effective representations are important to derive the correspondences b...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
The remarkable success of deep learning in various domains relies on the availability of large-scale...
International audienceIn this paper, we propose a self-supervised method for video representation le...
In the image domain, excellent representations can be learned by inducing invariance to content-pres...
We present ConCur, a contrastive video representation learning method that uses curriculum learning ...
This thesis presents a novel self-supervised approach of learning visual representations from videos...
Self-supervised video representation learning aimed at maximizing similarity between different tempo...
The objective of this paper is visual-only self-supervised video representation learning. We make th...
In recent research, the self-supervised video representation learning methods have achieved improve...
A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervise...
There is a natural correlation between the visual and auditive elements of a video. In this work, we...
With the rapid advancement of deep learning techniques in computer vision, researchers have achieved...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
In low-level video analyses, effective representations are important to derive the correspondences b...
Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound s...
The remarkable success of deep learning in various domains relies on the availability of large-scale...
International audienceIn this paper, we propose a self-supervised method for video representation le...
In the image domain, excellent representations can be learned by inducing invariance to content-pres...
We present ConCur, a contrastive video representation learning method that uses curriculum learning ...
This thesis presents a novel self-supervised approach of learning visual representations from videos...
Self-supervised video representation learning aimed at maximizing similarity between different tempo...
The objective of this paper is visual-only self-supervised video representation learning. We make th...
In recent research, the self-supervised video representation learning methods have achieved improve...
A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervise...
There is a natural correlation between the visual and auditive elements of a video. In this work, we...
With the rapid advancement of deep learning techniques in computer vision, researchers have achieved...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We present CrissCross, a self-supervised framework for learning audio-visual representations. A nove...
In low-level video analyses, effective representations are important to derive the correspondences b...