The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual informa...
Recent studies show that facial information contained in visual speech can be helpful for the perfor...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
Abstract—The visual modality, deemed to be complementary to the audio modality, has recently been ex...
The work of Bernstein and Benoît has confirmed that it is advantageous to use multiple senses, for e...
Speech separation is the task of segregating a target speech signal from background interference. To...
International audienceThe work of Bernstein and Benoît has confirmed that it is advantageous to use ...
Humans are skilled in selectively extracting a single sound source in the presence of multiple simul...
Human listeners have the extraordinary ability to hear and recognize speech even when more than one ...
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberati...
Despite the recent progress of automatic speech recognition (ASR) driven by deep learning, conversat...
Human listeners have the extraordinary ability to hear and recognize speech even when more than one ...
In this paper we investigate the problem of integrating the complementary audio and visual modalitie...
This paper addresses a method of multichannel signal separation (MSS) with its application to cockta...
A technique for the early fusion of visual lip movements and a vector of mixed speech signals is pro...
Dataset This data repository contains separated audio streams for the LibriCSS dataset using the fo...
Recent studies show that facial information contained in visual speech can be helpful for the perfor...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
Abstract—The visual modality, deemed to be complementary to the audio modality, has recently been ex...
The work of Bernstein and Benoît has confirmed that it is advantageous to use multiple senses, for e...
Speech separation is the task of segregating a target speech signal from background interference. To...
International audienceThe work of Bernstein and Benoît has confirmed that it is advantageous to use ...
Humans are skilled in selectively extracting a single sound source in the presence of multiple simul...
Human listeners have the extraordinary ability to hear and recognize speech even when more than one ...
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberati...
Despite the recent progress of automatic speech recognition (ASR) driven by deep learning, conversat...
Human listeners have the extraordinary ability to hear and recognize speech even when more than one ...
In this paper we investigate the problem of integrating the complementary audio and visual modalitie...
This paper addresses a method of multichannel signal separation (MSS) with its application to cockta...
A technique for the early fusion of visual lip movements and a vector of mixed speech signals is pro...
Dataset This data repository contains separated audio streams for the LibriCSS dataset using the fo...
Recent studies show that facial information contained in visual speech can be helpful for the perfor...
Language is an integral part of human interpersonal communication, which is conveyed through multipl...
Abstract—The visual modality, deemed to be complementary to the audio modality, has recently been ex...