End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it easy to use a single model for both multilingual ASR and many-to-many ST. In this paper, we propose streaming language-agnostic multilingual speech recognition and translation using neural transducers (LAMASSU). To enable multilingual text generation in LAMASSU, we conduct a systematic comparison between specified and unified prediction and joint networks. We leverage a language-agnostic multilingual encoder that substantially outperforms shared encoders. To enhance LAMASSU, we propose to feed target LID to encoders. We also apply connectionist temporal classification regularization to transducer training. Experimental results show that LAMASSU...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
This paper describes the University of Helsinki Language Technology group’s participation in the IWS...
Simultaneous translation systems start producing the output while processing the partial source sent...
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we in...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
Humans benefit from communication but suffer from language barriers. Machine translation (MT) aims t...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
In this article, we propose a simple yet effective approach to train an end-to-end speech recognitio...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
The recent development of neural network-based automatic speech recognition (ASR) systems has greatl...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
Nowadays, training end-to-end neural models for spoken language translation (SLT) still has to confr...
As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
This paper describes the University of Helsinki Language Technology group’s participation in the IWS...
Simultaneous translation systems start producing the output while processing the partial source sent...
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we in...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
Humans benefit from communication but suffer from language barriers. Machine translation (MT) aims t...
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech r...
In this article, we propose a simple yet effective approach to train an end-to-end speech recognitio...
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streamin...
The recent development of neural network-based automatic speech recognition (ASR) systems has greatl...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
[EN] The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Aut...
Nowadays, training end-to-end neural models for spoken language translation (SLT) still has to confr...
As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
This paper describes the University of Helsinki Language Technology group’s participation in the IWS...
Simultaneous translation systems start producing the output while processing the partial source sent...