Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in separate spaces, resulting in a large heterogeneity gap between modalities. Existing attention modules, on the other hand, ignore the temporal asynchrony between vision and hearing when constructing cross-modal connections, which may lead to the misinterpretation of one modality by another. Therefore, this paper aims to improve event localization performance by addressing these two problems and proposes a framework that feeds audio and visual featu...
Audio-visual event detection aims to identify semantically defined events that reveal human activiti...
International audienceVisual attention modeling is a very active research field and several image an...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video in...
International audienceAudiovisual (AV) representation learning is an important task from the perspec...
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both...
Temporal action localization aims at localizing action instances from untrimmed videos. Existing wor...
Exploiting the multimodal and temporal interaction between audio-visual channels is essential for au...
Models based on diverse attention mechanisms have recently shined in tasks related to acoustic event...
Emotions play a crucial role in human-human communications with complex socio-psychological nature. ...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Audio-visual event detection aims to identify semantically defined events that reveal human activiti...
International audienceVisual attention modeling is a very active research field and several image an...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video in...
International audienceAudiovisual (AV) representation learning is an important task from the perspec...
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both...
Temporal action localization aims at localizing action instances from untrimmed videos. Existing wor...
Exploiting the multimodal and temporal interaction between audio-visual channels is essential for au...
Models based on diverse attention mechanisms have recently shined in tasks related to acoustic event...
Emotions play a crucial role in human-human communications with complex socio-psychological nature. ...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Audio-visual event detection aims to identify semantically defined events that reveal human activiti...
International audienceVisual attention modeling is a very active research field and several image an...
Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive ...