Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recovering the missing modality from the existing one based on the common information shared between them and the prior information of the specific modality, great bonus will be gained for various vision tasks. In this paper, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation. Specifical...
Synthesizing audio-reactive videos to accompany music is challenging multi-domain task that requires...
Detecting complex video events based on audio and visual modalities is still a largely unresolved is...
Video caption refers to generating a descriptive sentence for a specific short video clip automatica...
Cross-modal generation is playing an important role in translating information between different dat...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
From recalling long forgotten experiences based on a familiar scent or on a piece of music, to lip r...
Learning joint embedding space for various modalities is of vital importance for multimodal fusion. ...
Audio-visual emotion recognition is the research of identifying human emotional states by combining ...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
© 1979-2012 IEEE. People can recognize scenes across many different modalities beyond natural images...
Comunicació presentada a: 18th International Society for Music Information Retrieval Conference (ISM...
Synthesizing audio-reactive videos to accompany music is challenging multi-domain task that requires...
Detecting complex video events based on audio and visual modalities is still a largely unresolved is...
Video caption refers to generating a descriptive sentence for a specific short video clip automatica...
Cross-modal generation is playing an important role in translating information between different dat...
In this paper our objectives are, first, networks that can embed audio and visual inputs into a comm...
People can easily imagine the potential sound while seeing an event. This natural synchronization be...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
We propose a novel deep training algorithm for joint representation of audio and visual information ...
From recalling long forgotten experiences based on a familiar scent or on a piece of music, to lip r...
Learning joint embedding space for various modalities is of vital importance for multimodal fusion. ...
Audio-visual emotion recognition is the research of identifying human emotional states by combining ...
In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, featu...
In recent years, there have been numerous developments toward solving multimodal tasks, aiming to le...
© 1979-2012 IEEE. People can recognize scenes across many different modalities beyond natural images...
Comunicació presentada a: 18th International Society for Music Information Retrieval Conference (ISM...
Synthesizing audio-reactive videos to accompany music is challenging multi-domain task that requires...
Detecting complex video events based on audio and visual modalities is still a largely unresolved is...
Video caption refers to generating a descriptive sentence for a specific short video clip automatica...