Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Res...
In pattern recognition one usually relies on measuring a set of informative features to perform task...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
In this paper, we address the problem of automatic discovery of speech patterns using audio-visual i...
We present recent work on improving the performance of automated speech recognizers by using additio...
163 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2000.Computer technologies have im...
Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to ...
Humans are often able to compensate for noise degradation and uncertainty in speech information by a...
This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, ...
Humans are often able to compensate for noise degradation and uncertainty in speech information by a...
The use of visual features in the form of lip movements to improve the performance of acoustic speec...
Extending automatic speech recognition (ASR) to the visual modality has been shown to greatly increa...
Automatic speech recognition (ASR) permits effective interaction between humans and machines in envi...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
A major goal of current speech recognition research is to improve the robustness of recognition syst...
Extending automatic speech recognition (ASR) to the vi sual modality has been shown to greatly incre...
In pattern recognition one usually relies on measuring a set of informative features to perform task...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
In this paper, we address the problem of automatic discovery of speech patterns using audio-visual i...
We present recent work on improving the performance of automated speech recognizers by using additio...
163 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2000.Computer technologies have im...
Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to ...
Humans are often able to compensate for noise degradation and uncertainty in speech information by a...
This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, ...
Humans are often able to compensate for noise degradation and uncertainty in speech information by a...
The use of visual features in the form of lip movements to improve the performance of acoustic speec...
Extending automatic speech recognition (ASR) to the visual modality has been shown to greatly increa...
Automatic speech recognition (ASR) permits effective interaction between humans and machines in envi...
Human perception and learning are inherently multimodal: we interface with the world through multipl...
A major goal of current speech recognition research is to improve the robustness of recognition syst...
Extending automatic speech recognition (ASR) to the vi sual modality has been shown to greatly incre...
In pattern recognition one usually relies on measuring a set of informative features to perform task...
With the advance in self-supervised learning for audio and visual modalities, it has become possible...
In this paper, we address the problem of automatic discovery of speech patterns using audio-visual i...