While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to ha...
Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, the...
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing...
Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) perfor...
In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for sp...
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(...
Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Self-supervised speech models have grown fast during the past few years and have proven feasible for...
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge...
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec...
Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, the...
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing...
Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) perfor...
In this paper, we propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc, for sp...
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(...
Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Self-supervised speech models have grown fast during the past few years and have proven feasible for...
Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge...
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec...
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech...
Voice cloning is a difficult task which requires robust and informative features incorporated in a h...
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec...
Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing...
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voi...
Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, the...