We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-doma...
In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recogniti...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models ...
In this paper, we investigate the usage of large language models (LLMs) to improve the performance o...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined fr...
While Automatic Speech Recognition (ASR) models have shown significant advances with the introductio...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning fr...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recogniti...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models ...
In this paper, we investigate the usage of large language models (LLMs) to improve the performance o...
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied t...
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined fr...
While Automatic Speech Recognition (ASR) models have shown significant advances with the introductio...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning fr...
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic trans...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recogniti...
End-to-end formulation of automatic speech recognition (ASR) and speech translation (ST) makes it ea...
We present a method for cross-lingual training an ASR system using absolutely no transcribed trainin...