Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on be...
Although recent advances in deep learning technology improved automatic speech recognition (ASR), it...
Speech segmentation, which splits long speech into short segments, is essential for speech translati...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key...
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to id...
The vast majority of ASR research uses corpora in which both the training and test data have been pr...
Speech segmentation is the problem of finding the end points of a speech utterance for passing to an...
Streaming recognition and segmentation of multi-party conversations with overlapping speech is cruci...
Abstract Appropriate turn-taking is an important issue in spoken dialogue systems. Especially in one...
The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of ...
Speech translation models are unable to directly process long audios, like TED talks, which have to ...
Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeou...
Currently, there are mainly three Transformer encoder based streaming End to End (E2E) Automatic Spe...
Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence l...
This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acou...
Although recent advances in deep learning technology improved automatic speech recognition (ASR), it...
Speech segmentation, which splits long speech into short segments, is essential for speech translati...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key...
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to id...
The vast majority of ASR research uses corpora in which both the training and test data have been pr...
Speech segmentation is the problem of finding the end points of a speech utterance for passing to an...
Streaming recognition and segmentation of multi-party conversations with overlapping speech is cruci...
Abstract Appropriate turn-taking is an important issue in spoken dialogue systems. Especially in one...
The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of ...
Speech translation models are unable to directly process long audios, like TED talks, which have to ...
Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeou...
Currently, there are mainly three Transformer encoder based streaming End to End (E2E) Automatic Spe...
Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence l...
This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acou...
Although recent advances in deep learning technology improved automatic speech recognition (ASR), it...
Speech segmentation, which splits long speech into short segments, is essential for speech translati...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...