Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization

Yue Ran
Hongying Tang
Baoqing Li
Guohui Wang

Open link

Publication date

December 2022

DOI

10.3390/app122412622

Publisher

MDPI AG

Journal

Applied Sciences

Abstract

Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in separate spaces, resulting in a large heterogeneity gap between modalities. Existing attention modules, on the other hand, ignore the temporal asynchrony between vision and hearing when constructing cross-modal connections, which may lead to the misinterpretation of one modality by another. Therefore, this paper aims to improve event localization performance by addressing these two problems and proposes a framework that feeds audio and visual featu...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization

Abstract

Extracted data

Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization

Abstract

Extracted data

Related items

Related items