Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. W...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre o...
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dat...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Human speech can be characterized by different components, including semantic content, speaker ident...
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressiv...
In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance ...
Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises th...
Better disentanglement of speech representation is essential to improve the quality of voice convers...
Recently end-to-end neural audio/speech coding has shown its great potential to outperform tradition...
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic v...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre o...
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dat...
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation...
Human speech can be characterized by different components, including semantic content, speaker ident...
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressiv...
In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance ...
Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises th...
Better disentanglement of speech representation is essential to improve the quality of voice convers...
Recently end-to-end neural audio/speech coding has shown its great potential to outperform tradition...
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic v...
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and pro...
Most people who have tried to learn a foreign language would have experienced difficulties understan...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Methods for extracting audio and speech features have been studied since pioneering work on spectrum...
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and...