Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, ...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input ...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-hum...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
In recent years, self-supervised learning paradigm has received extensive attention due to its great...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstre...
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on eith...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressiv...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input ...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
The recent advances in text-to-speech have been awe-inspiring, to the point of synthesizing near-hum...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
In recent years, self-supervised learning paradigm has received extensive attention due to its great...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstre...
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on eith...
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditi...
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressiv...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of repres...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
In this paper, we propose an end-to-end text-to-speech system deployment wherein a user feeds input ...