We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn s...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is p...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
This paper investigates self-supervised pre-training for audio-visual speaker representation learnin...
Although speech is a simple and effective way for humans to communicate with the outside world, a mo...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Self supervised representation learning has recently attracted a lot of research interest for both t...
Decades of research in acoustic speech recognition have led to systems that we use in our everyday l...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a s...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is p...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
This paper investigates self-supervised pre-training for audio-visual speaker representation learnin...
Although speech is a simple and effective way for humans to communicate with the outside world, a mo...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Self supervised representation learning has recently attracted a lot of research interest for both t...
Decades of research in acoustic speech recognition have led to systems that we use in our everyday l...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a s...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
Learning rich visual representations using contrastive self-supervised learning has been extremely s...
Deep learning has demonstrated impressive results for tasks where the training of neural networks ca...