Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leaked timbre of the source speaker. In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs. VC systems trained with IBFs retain more prosody and vocalization information of the source speaker. Furthermore, we propose a non-streaming teacher guidance (TG) framework that addresses the timbre leakage problem. Experiments...
Background sound is an informative form of art that is helpful in providing a more immersive experie...
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker ...
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech wav...
Voice conversion for highly expressive speech is challenging. Current approaches struggle with the b...
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressiv...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "arti...
Voice conversion (VC) transforms an utterance to sound like another person without changing the ling...
Better disentanglement of speech representation is essential to improve the quality of voice convers...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech wav...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Background sound is an informative form of art that is helpful in providing a more immersive experie...
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker ...
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech wav...
Voice conversion for highly expressive speech is challenging. Current approaches struggle with the b...
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressiv...
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research top...
Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
There is growing interest in unifying the streaming and full-context automatic speech recognition (A...
This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "arti...
Voice conversion (VC) transforms an utterance to sound like another person without changing the ling...
Better disentanglement of speech representation is essential to improve the quality of voice convers...
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which ...
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech wav...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance ...
Background sound is an informative form of art that is helpful in providing a more immersive experie...
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker ...
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech wav...