In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS ...
Speaker identification systems in a real-world scenario are tasked to identify a speaker amongst a s...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstre...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Most state-of-the-art Deep Learning (DL) approaches forspeaker recognition work on a short utterance...
This paper presents the SJTU system for both text-dependent and text-independent tasks in short-dura...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
This paper explores three novel approaches to improve the performance of speaker verification (SV) s...
For self-supervised speaker verification, the quality of pseudo labels decides the upper bound of th...
While Automatic Speech Recognition (ASR) models have shown significant advances with the introductio...
This paper explores three novel approaches to improve the performance of speaker verification (SV) s...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS ...
Speaker identification systems in a real-world scenario are tasked to identify a speaker amongst a s...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstre...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Most state-of-the-art Deep Learning (DL) approaches forspeaker recognition work on a short utterance...
This paper presents the SJTU system for both text-dependent and text-independent tasks in short-dura...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
This paper explores three novel approaches to improve the performance of speaker verification (SV) s...
For self-supervised speaker verification, the quality of pseudo labels decides the upper bound of th...
While Automatic Speech Recognition (ASR) models have shown significant advances with the introductio...
This paper explores three novel approaches to improve the performance of speaker verification (SV) s...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS ...
Speaker identification systems in a real-world scenario are tasked to identify a speaker amongst a s...