In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 sco...
In this study, we investigate the process of generating single-sentence representations for the purp...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Inducing semantic representations directly from speech signals is a highly challenging task but has ...
As with human-human interaction, spoken human-computer dialog will contain situations where there is...
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstre...
An ideal spoken dialogue system listens continually and determines which utterances were spoken to i...
We use machine learners trained on a combination of acoustic confidence and pragmatic plausibility f...
Dialogue promises a natural and effective method for users to interact with and obtain information f...
Automatic speech recognizers (ASR) typically treat each utterance of a conversation independently. T...
This paper describes the incorporation of contextual information into spoken dialogue systems in the...
What role do linguistic cues on a surface and contextual level have in identifying the intention beh...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
While named entity recognition (NER) from speech has been around as long as NER from written text ha...
dialog This paper describes a way ofusing intonation and dialog context to improve the performance o...
Although there have been remarkable advances in dialogue systems through the dialogue systems techno...
In this study, we investigate the process of generating single-sentence representations for the purp...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Inducing semantic representations directly from speech signals is a highly challenging task but has ...
As with human-human interaction, spoken human-computer dialog will contain situations where there is...
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstre...
An ideal spoken dialogue system listens continually and determines which utterances were spoken to i...
We use machine learners trained on a combination of acoustic confidence and pragmatic plausibility f...
Dialogue promises a natural and effective method for users to interact with and obtain information f...
Automatic speech recognizers (ASR) typically treat each utterance of a conversation independently. T...
This paper describes the incorporation of contextual information into spoken dialogue systems in the...
What role do linguistic cues on a surface and contextual level have in identifying the intention beh...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
While named entity recognition (NER) from speech has been around as long as NER from written text ha...
dialog This paper describes a way ofusing intonation and dialog context to improve the performance o...
Although there have been remarkable advances in dialogue systems through the dialogue systems techno...
In this study, we investigate the process of generating single-sentence representations for the purp...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Inducing semantic representations directly from speech signals is a highly challenging task but has ...