We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a ...
The thesis is a replication of the work by Takaaki Hori and his colleagues (2019), which introduces ...
Currently, there are mainly three Transformer encoder based streaming End to End (E2E) Automatic Spe...
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SL...
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours ...
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to id...
During conversations, humans are capable of inferring the intention of the speaker at any point of t...
This paper presents an in-depth study on a Sequentially Sampled Chunk Conformer, SSC-Conformer, for ...
The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of ...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
Although recent advances in deep learning technology have boosted automatic speech recognition (ASR)...
Recently, the unified streaming and non-streaming two-pass (U2/U2++) end-to-end model for speech rec...
Conformers have recently been proposed as a promising modelling approach for automatic speech recogn...
Optimization of modern ASR architectures is among the highest priority tasks since it saves many com...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
Learning a set of tasks in sequence remains a challenge for artificial neural networks, which, in su...
The thesis is a replication of the work by Takaaki Hori and his colleagues (2019), which introduces ...
Currently, there are mainly three Transformer encoder based streaming End to End (E2E) Automatic Spe...
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SL...
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours ...
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to id...
During conversations, humans are capable of inferring the intention of the speaker at any point of t...
This paper presents an in-depth study on a Sequentially Sampled Chunk Conformer, SSC-Conformer, for ...
The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of ...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
Although recent advances in deep learning technology have boosted automatic speech recognition (ASR)...
Recently, the unified streaming and non-streaming two-pass (U2/U2++) end-to-end model for speech rec...
Conformers have recently been proposed as a promising modelling approach for automatic speech recogn...
Optimization of modern ASR architectures is among the highest priority tasks since it saves many com...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
Learning a set of tasks in sequence remains a challenge for artificial neural networks, which, in su...
The thesis is a replication of the work by Takaaki Hori and his colleagues (2019), which introduces ...
Currently, there are mainly three Transformer encoder based streaming End to End (E2E) Automatic Spe...
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SL...