Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to rec...
Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. ...
Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousand...
The development of a speech recognition system requires at least three resources: a large labeled sp...
This electronic version was submitted by the student author. The certified thesis is available in th...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speec...
While audio-visual speech models can yield superior performance and robustness compared to audio-onl...
Although speech is a simple and effective way for humans to communicate with the outside world, a mo...
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NM...
International audienceSelf-supervised learning from raw speech has been proven beneficial to improve...
Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. ...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Many of today's state-of-the-art automatic speech recognition (ASR) systems are based on hybrid hidd...
Pre-trained speech Transformers have facilitated great success across various speech processing task...
This paper investigates the potential of improving a hybrid automatic speech recognition model train...
Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. ...
Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousand...
The development of a speech recognition system requires at least three resources: a large labeled sp...
This electronic version was submitted by the student author. The certified thesis is available in th...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
How to boost speech pre-training with textual data is an unsolved problem due to the fact that speec...
While audio-visual speech models can yield superior performance and robustness compared to audio-onl...
Although speech is a simple and effective way for humans to communicate with the outside world, a mo...
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NM...
International audienceSelf-supervised learning from raw speech has been proven beneficial to improve...
Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. ...
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech ...
Many of today's state-of-the-art automatic speech recognition (ASR) systems are based on hybrid hidd...
Pre-trained speech Transformers have facilitated great success across various speech processing task...
This paper investigates the potential of improving a hybrid automatic speech recognition model train...
Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. ...
Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousand...
The development of a speech recognition system requires at least three resources: a large labeled sp...