Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech,...
In this paper, we study how to use masked signal modeling in vision and language (V+L) representatio...
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robu...
Multimodal pre-training for audio-and-text has recently been proved to be effective and has signific...
While audio-visual speech models can yield superior performance and robustness compared to audio-onl...
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speec...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
This paper investigates self-supervised pre-training for audio-visual speaker representation learnin...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
We present Maestro, a self-supervised training method to unify representations learnt from speech an...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
Recently, speech representation learning has improved many speech-related tasks such as speech recog...
In this paper, we study how to use masked signal modeling in vision and language (V+L) representatio...
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robu...
Multimodal pre-training for audio-and-text has recently been proved to be effective and has signific...
While audio-visual speech models can yield superior performance and robustness compared to audio-onl...
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speec...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
This paper investigates self-supervised pre-training for audio-visual speaker representation learnin...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
We present Maestro, a self-supervised training method to unify representations learnt from speech an...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correc...
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and ...
Recently, speech representation learning has improved many speech-related tasks such as speech recog...
In this paper, we study how to use masked signal modeling in vision and language (V+L) representatio...
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robu...
Multimodal pre-training for audio-and-text has recently been proved to be effective and has signific...