This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural si...
This paper describes noisy speech recognition for an augmented reality headset that helps verbal com...
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer...
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems s...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that ...
Although recent advances in deep learning technology have boosted automatic speech recognition (ASR)...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process dist...
During conversations, humans are capable of inferring the intention of the speaker at any point of t...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
Streaming recognition and segmentation of multi-party conversations with overlapping speech is cruci...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we in...
This paper describes noisy speech recognition for an augmented reality headset that helps verbal com...
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer...
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems s...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that ...
Although recent advances in deep learning technology have boosted automatic speech recognition (ASR)...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process dist...
During conversations, humans are capable of inferring the intention of the speaker at any point of t...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
Streaming recognition and segmentation of multi-party conversations with overlapping speech is cruci...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we in...
This paper describes noisy speech recognition for an augmented reality headset that helps verbal com...
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer...
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems s...