Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntax-informed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models’ output. So far, researchers have focused on improving audio segmentation towards producing sentence-like splits. In this paper, instead, we address the issue in the model, making it more robust to a different, potentially sub-optimal segmentation. To this aim, we tra...
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours ...
Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable...
The recognition of speech involves the segmentation of continuous utterances into their component wo...
The audio segmentation mismatch between training data and those seen at run-time is a major problem ...
This paper describes FBK’s system submission to the IWSLT 2021 Offline Speech Translation task. We p...
Speech segmentation is the problem of finding the end points of a speech utterance for passing to an...
For real-life applications, it is crucial that end-to-end spoken language translation models perform...
Speech segmentation, which splits long speech into short segments, is essential for speech translati...
Document-level contextual information has shown benefits to text-based machine translation, but whet...
Segmentation methods are an essential part of the simultaneous machine translation process because, ...
Article pendent de revisió a l'Interspeech 2022Speech translation models are unable to directly proc...
Automatic sentence segmentation of speech is important for enriching speech recognition output and a...
Recent studies on direct speech translation show continuous improvements by means of data augmentati...
We explore the use of prosodic features beyond pauses, including duration, pitch, and energy feature...
This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. S...
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours ...
Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable...
The recognition of speech involves the segmentation of continuous utterances into their component wo...
The audio segmentation mismatch between training data and those seen at run-time is a major problem ...
This paper describes FBK’s system submission to the IWSLT 2021 Offline Speech Translation task. We p...
Speech segmentation is the problem of finding the end points of a speech utterance for passing to an...
For real-life applications, it is crucial that end-to-end spoken language translation models perform...
Speech segmentation, which splits long speech into short segments, is essential for speech translati...
Document-level contextual information has shown benefits to text-based machine translation, but whet...
Segmentation methods are an essential part of the simultaneous machine translation process because, ...
Article pendent de revisió a l'Interspeech 2022Speech translation models are unable to directly proc...
Automatic sentence segmentation of speech is important for enriching speech recognition output and a...
Recent studies on direct speech translation show continuous improvements by means of data augmentati...
We explore the use of prosodic features beyond pauses, including duration, pitch, and energy feature...
This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. S...
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours ...
Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable...
The recognition of speech involves the segmentation of continuous utterances into their component wo...