Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest ava...
© 2016 IEEE.Automatic speech recognition (ASR) has become a widespread and convenient mode of human-...
Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades signi...
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic...
This paper investigates self-supervised pre-training for audio-visual speaker representation learnin...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robu...
Decades of research in acoustic speech recognition have led to systems that we use in our everyday l...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Speech is the most important tool of interaction among human beings. This has inspired researchers t...
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations i...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging ...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
© 2016 IEEE.Automatic speech recognition (ASR) has become a widespread and convenient mode of human-...
Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades signi...
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic...
This paper investigates self-supervised pre-training for audio-visual speaker representation learnin...
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech...
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robu...
Decades of research in acoustic speech recognition have led to systems that we use in our everyday l...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Speech is the most important tool of interaction among human beings. This has inspired researchers t...
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations i...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging ...
Can we leverage the audiovisual information already present in video to improve self-supervised repr...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
© 2016 IEEE.Automatic speech recognition (ASR) has become a widespread and convenient mode of human-...
Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades signi...
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic...