In human multi-modality perception systems, the benefits of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select “where” to attend, “when” to attend and “which” to attend for audio-visual event localization. In this way, even with large temporal inconsistent between visio...
Everyday experience involves the continuous integration of information from multiple sensory inputs....
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically ...
Several studies on cross-modal attention showed that remapping processes between sensory modalities ...
Abstract. We introduce a computational model of sensor fusion based on the topographic representatio...
Humans show a remarkable perceptual ability to select the speech stream of interest among multiple c...
Convergence of multisensory information can improve the likelihood of detecting and responding to an...
Cognitive processes, including those important in psychophysics, arise from the coordinated activity...
The aim of the experiments reported in this thesis was to investigate the multisensory interactions ...
Perception of auditory events is inherently multimodal relying on both audio and visual cues. A larg...
Everyday experience involves the continuous integration of information from multiple sensory inputs....
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both...
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn ...
Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically ...
Several studies on cross-modal attention showed that remapping processes between sensory modalities ...
Abstract. We introduce a computational model of sensor fusion based on the topographic representatio...
Humans show a remarkable perceptual ability to select the speech stream of interest among multiple c...
Convergence of multisensory information can improve the likelihood of detecting and responding to an...
Cognitive processes, including those important in psychophysics, arise from the coordinated activity...
The aim of the experiments reported in this thesis was to investigate the multisensory interactions ...
Perception of auditory events is inherently multimodal relying on both audio and visual cues. A larg...
Everyday experience involves the continuous integration of information from multiple sensory inputs....
This paper for the first time explores audio-visual event localization in an unsupervised manner. Pr...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...