In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot ca...
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real...
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (AS...
In this article, we propose a simple yet effective approach to train an end-to-end speech recognitio...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error pro...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we in...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech ...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
In simultaneous speech translation (SimulST), finding the best trade-off between high translation qu...
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real...
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (AS...
In this article, we propose a simple yet effective approach to train an end-to-end speech recognitio...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error pro...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we in...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech ...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
In simultaneous speech translation (SimulST), finding the best trade-off between high translation qu...
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real...
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (AS...
In this article, we propose a simple yet effective approach to train an end-to-end speech recognitio...