Speaker-attributed automatic speech recognition (SA-ASR) in multiparty meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. SC-FD-SOT obtains the speaker-attributed transcriptions by aligning the speaker diarization results with the ASR hypotheses, SC-WD-SOT uses word-level diarization to get rid of the alignment dependence on timestamps, and SC-TS-ASR jointly trains target-speaker separation and ASR modules, which achieves the best ...
This paper describes noisy speech recognition for an augmented reality headset that helps verbal com...
Self-supervised learning (SSL) methods which learn representations of data without explicit supervis...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA...
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems s...
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of ...
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that ...
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process dist...
Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which ...
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a...
In this paper, we present a novel system for joint speaker identification and speech separation. For...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-...
Automatic speech recognition (ASR) refers to the task of extracting a transcription of the linguisti...
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberati...
This paper describes noisy speech recognition for an augmented reality headset that helps verbal com...
Self-supervised learning (SSL) methods which learn representations of data without explicit supervis...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA...
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems s...
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of ...
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that ...
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process dist...
Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which ...
In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a...
In this paper, we present a novel system for joint speaker identification and speech separation. For...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-...
Automatic speech recognition (ASR) refers to the task of extracting a transcription of the linguisti...
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberati...
This paper describes noisy speech recognition for an augmented reality headset that helps verbal com...
Self-supervised learning (SSL) methods which learn representations of data without explicit supervis...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...