In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a cocktail party scenario. Therefore we considered two simple end-to-end LSTM-based models that perform single-channel audiovisual speech enhancement and phone recognition respectively. Then, we studied how the two models interact, and how to train them jointly affects the final result. We analyzed different training strategies that reveal some interesting and unexpected behaviors. The experiments show that during optimization of the ASR task the speech enhancement capability of the model significantly decreases and vice-versa. Nevertheless the joint optimization of the two tasks shows a remarkable drop of the Phone Error Rate (PER) compared t...
International audienceDistant-microphone automatic speech recognition (ASR) remains a challenging go...
Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and ...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a...
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a...
Automatic speech recognition (ASR) refers to the task of extracting a transcription of the linguisti...
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of ...
Automatic speech recognition (ASR) permits effective interaction between humans and machines in envi...
International audienceMulti-microphone signal processing techniques have the potential to greatly im...
Human speech processing is often a multimodal process combining audio and visual processing. Eyes a...
The use of visual features in the form of lip movements to improve the performance of acoustic speec...
In this paper, we address the problem of enhancing the speech of a speaker of interest in a cocktai...
Speaker-attributed automatic speech recognition (SA-ASR) in multiparty meeting scenarios is one of t...
In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success...
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberati...
International audienceDistant-microphone automatic speech recognition (ASR) remains a challenging go...
Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and ...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a...
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a...
Automatic speech recognition (ASR) refers to the task of extracting a transcription of the linguisti...
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of ...
Automatic speech recognition (ASR) permits effective interaction between humans and machines in envi...
International audienceMulti-microphone signal processing techniques have the potential to greatly im...
Human speech processing is often a multimodal process combining audio and visual processing. Eyes a...
The use of visual features in the form of lip movements to improve the performance of acoustic speec...
In this paper, we address the problem of enhancing the speech of a speaker of interest in a cocktai...
Speaker-attributed automatic speech recognition (SA-ASR) in multiparty meeting scenarios is one of t...
In automatic speech recognition systems (ASRs), training is a critical phase to the system?s success...
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberati...
International audienceDistant-microphone automatic speech recognition (ASR) remains a challenging go...
Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and ...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...