End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extrac...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
To extract robust deep representations from long sequential modeling of speech data, we propose a se...
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize m...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of...
Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades signi...
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete c...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
International audienceSelf-supervised learning from raw speech has been proven beneficial to improve...
Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tas...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In recent years, speech-based self-supervised learning (SSL) has made significant progress in variou...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
To extract robust deep representations from long sequential modeling of speech data, we propose a se...
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize m...
Traditionally, research in automated speech recognition has focused on local-first encoding of audio...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of...
Automatic speech recognition (ASR) has shown rapid advances in recent years but still degrades signi...
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete c...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
International audienceSelf-supervised learning from raw speech has been proven beneficial to improve...
Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tas...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In recent years, speech-based self-supervised learning (SSL) has made significant progress in variou...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-d...
To extract robust deep representations from long sequential modeling of speech data, we propose a se...