International audience— In this paper we present an integrated unsupervised method to produce a quality corpus for training automatic speech recognition system (ASR) using prompts or closed captions. Closed captions and prompts do not always have timestamps and do not necessarily correspond to the exact speech. We propose a method allowing to extract quality corpus from imperfect transcript. The proposed approach works in two steps. During the search, the ASR system finds matching segments in a large prompt database. Matching segments are then used inside a Driven Decoding Algorithm (DDA) to produce a high quality corpus. Results show a F-measure of 96% in term of spotting while the DDA corrects the output according to the prompts: a high q...
The newest generation of speech technology caused a huge increase of audio-visual data nowadays bein...
We investigate the problem of predicting the quality of automatic speech recognition (ASR) output ...
In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from ...
International audienceIn many cases, textual information can be associated with speech signals such ...
This paper addresses the problem of using journalist prompts or closed captions to build corpora for...
We describe an efficient procedure for automatic repair of quickly transcribed (QT) speech. QT speec...
The increased availability of broadband connections has recently led to an increase in the use of In...
The paper addresses a scheme of lightly supervised training of an acoustic model, which exploits a l...
We address the problem of estimating the quality of Automatic Speech Recognition (ASR) output at utt...
Large-scale spontaneous speech corpora are crucial resource for various domains of spoken language p...
This paper compares schemes for the selection of multi-genre broadcast data and corresponding transc...
Automatic speech recognition (ASR) in the educational environment could be a solution to address the...
We present a test corpus of audio recordings and transcriptions of presentations of students' enterp...
In the last decade automated captioning services have appeared in mainstream technology use. Until n...
LREC2006: the 5th international conference on Language Resources and Evaluation, May 2006.This paper...
The newest generation of speech technology caused a huge increase of audio-visual data nowadays bein...
We investigate the problem of predicting the quality of automatic speech recognition (ASR) output ...
In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from ...
International audienceIn many cases, textual information can be associated with speech signals such ...
This paper addresses the problem of using journalist prompts or closed captions to build corpora for...
We describe an efficient procedure for automatic repair of quickly transcribed (QT) speech. QT speec...
The increased availability of broadband connections has recently led to an increase in the use of In...
The paper addresses a scheme of lightly supervised training of an acoustic model, which exploits a l...
We address the problem of estimating the quality of Automatic Speech Recognition (ASR) output at utt...
Large-scale spontaneous speech corpora are crucial resource for various domains of spoken language p...
This paper compares schemes for the selection of multi-genre broadcast data and corresponding transc...
Automatic speech recognition (ASR) in the educational environment could be a solution to address the...
We present a test corpus of audio recordings and transcriptions of presentations of students' enterp...
In the last decade automated captioning services have appeared in mainstream technology use. Until n...
LREC2006: the 5th international conference on Language Resources and Evaluation, May 2006.This paper...
The newest generation of speech technology caused a huge increase of audio-visual data nowadays bein...
We investigate the problem of predicting the quality of automatic speech recognition (ASR) output ...
In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from ...