International audienceAudiovisual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in unconstrained videos. Importantly, this is done using weak labels where only video-level event labels are known without any information about their location in time. We show that the learnt representations are useful for performing several tasks such as event/object classification, audio event detection, audio source separation and visual object localization. An important feature of...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
In the acoustic scene classification (ASC) task, an acoustic scene consists of diverse sounds and is...
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed ...
We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video in...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
This HDR manuscript summarizes our work concerning the applications of machine learning techniques t...
International audienceIn this paper, we address the detection of audio events in domestic environmen...
International audienceThe design of new methods and models when only weakly-labeled data are availab...
Audio-visual event detection aims to identify semantically defined events that reveal human activiti...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
In this paper, we present a gated convolutional neural network and a temporal attention-based locali...
Self-supervised representation learning can mitigate the limitations in recognition tasks with few m...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
In the acoustic scene classification (ASC) task, an acoustic scene consists of diverse sounds and is...
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed ...
We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video in...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
This HDR manuscript summarizes our work concerning the applications of machine learning techniques t...
International audienceIn this paper, we address the detection of audio events in domestic environmen...
International audienceThe design of new methods and models when only weakly-labeled data are availab...
Audio-visual event detection aims to identify semantically defined events that reveal human activiti...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
In this paper, we present a gated convolutional neural network and a temporal attention-based locali...
Self-supervised representation learning can mitigate the limitations in recognition tasks with few m...
In this paper, we investigate how to learn rich and robust feature representations for audio classif...
In the acoustic scene classification (ASC) task, an acoustic scene consists of diverse sounds and is...
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed ...