Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashio...
This paper investigates four single-channel speech dereverberation algorithms, i.e., two unsupervise...
Many speech technologies, such as automatic speech recognition and speaker identification, are conve...
Extraction of a target speech signal from the convolutive mixture of multiple sources observed in a ...
Humans are skilled in selectively extracting a single sound source in the presence of multiple simul...
This paper proposes a neural network based system for multi-channel speech enhancement and dereverbe...
In speech communication systems, such as voice-controlled systems, hands-free mobile telephones, and...
Multichannel blind source separation performances rapidly degrade when the mixtures are highly rever...
This paper examines the performance of several source separation systems on a speech separation task...
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition ...
Acoustic reverberation arises from the reflection of sound waves within an enclosed space. It is gen...
International audience<p>Multichannel blind source separation performances rapidlydegrade when the m...
Abstract—This paper examines the performance of several source separation systems on a speech separa...
Despite the recent progress of automatic speech recognition (ASR) driven by deep learning, conversat...
International audienceMulti-microphone signal processing techniques have the potential to greatly im...
In real world environments, the speech signals received by our ears are usually a combination of dif...
This paper investigates four single-channel speech dereverberation algorithms, i.e., two unsupervise...
Many speech technologies, such as automatic speech recognition and speaker identification, are conve...
Extraction of a target speech signal from the convolutive mixture of multiple sources observed in a ...
Humans are skilled in selectively extracting a single sound source in the presence of multiple simul...
This paper proposes a neural network based system for multi-channel speech enhancement and dereverbe...
In speech communication systems, such as voice-controlled systems, hands-free mobile telephones, and...
Multichannel blind source separation performances rapidly degrade when the mixtures are highly rever...
This paper examines the performance of several source separation systems on a speech separation task...
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition ...
Acoustic reverberation arises from the reflection of sound waves within an enclosed space. It is gen...
International audience<p>Multichannel blind source separation performances rapidlydegrade when the m...
Abstract—This paper examines the performance of several source separation systems on a speech separa...
Despite the recent progress of automatic speech recognition (ASR) driven by deep learning, conversat...
International audienceMulti-microphone signal processing techniques have the potential to greatly im...
In real world environments, the speech signals received by our ears are usually a combination of dif...
This paper investigates four single-channel speech dereverberation algorithms, i.e., two unsupervise...
Many speech technologies, such as automatic speech recognition and speaker identification, are conve...
Extraction of a target speech signal from the convolutive mixture of multiple sources observed in a ...