Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual task...
In this paper we introduce MCA-NMF, a computational model of the acquisition of multi-modal concepts...
In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve sy...
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multipl...
Recent advances in the field of statistical learning have established that learners are able to trac...
Audio-visual speech enhancement system is regarded to be one of promising solutions for isolating an...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
That we perceive our environment as a unified scene rather than individual streams of auditory, visu...
doi: 10.3389/fnhum.2014.00829 Multisensory training can promote or impede visual perceptual learning...
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual e...
One view of speech perception is that acoustic signals are transformed into representations for patt...
We are grateful to Grace Yeni-Komshian, Ken Grant, Christian Lorenzi and Alain de Cheveigne ’ for in...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like ...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
In this paper we introduce MCA-NMF, a computational model of the acquisition of multi-modal concepts...
In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve sy...
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multipl...
Recent advances in the field of statistical learning have established that learners are able to trac...
Audio-visual speech enhancement system is regarded to be one of promising solutions for isolating an...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
That we perceive our environment as a unified scene rather than individual streams of auditory, visu...
doi: 10.3389/fnhum.2014.00829 Multisensory training can promote or impede visual perceptual learning...
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual e...
One view of speech perception is that acoustic signals are transformed into representations for patt...
We are grateful to Grace Yeni-Komshian, Ken Grant, Christian Lorenzi and Alain de Cheveigne ’ for in...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like ...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
In this paper we introduce MCA-NMF, a computational model of the acquisition of multi-modal concepts...
In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve sy...
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multipl...