Current state-of-the-art models in text-to-speech domain do not generate raw waveform directly. The models use variations of Mel frequency representations when generating speech which is then translated into raw waveform with a separately trained audio vocoder. This thesis studied two hypotheses. First, we studied if we can learn neural discrete representation from raw waveform speech using Vector Quantized Variational AutoEncoders. In results, we show that the model learns neural discrete representations that can be used for speech compression with high speech quality. We report perceptual evaluation speech score (PESQ) of 2.8 with our model which indicates comparable or higher speech quality to recent neural vocoders in literature. W...
The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and...
The accuracy of automatic speech recognizers has been constantly improving for decades. Aalto Univer...
Most speech synthesis systems require a linguistic module to produce the features that drive the spe...
Speech recognition systems generally need a large quantity of highly variable voice and recording co...
Despite the recent successes of neural networks in a variety of domains, musical audio modeling is s...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....
This thesis demonstrates the state-of-the-art technologies in text-to-speech synthesis for the Finni...
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-lev...
The public defense on 5th June 2020 at 12:00 will be organized via remote technology. Link: https:/...
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vec...
Modeling humans’ speech is a challenging task that originally required a coalition between phonetici...
This work focuses on single-word speech recognition, where the end goal is to accurately recognize a...
Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework ...
Tässä tutkimuksessa etsittiin puhekomennontunnistusmallia, joka voitaisiin kouluttaa pienellä määräl...
Recent advances in deep learning have enabled certain systems to approach or even achieve human pari...
The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and...
The accuracy of automatic speech recognizers has been constantly improving for decades. Aalto Univer...
Most speech synthesis systems require a linguistic module to produce the features that drive the spe...
Speech recognition systems generally need a large quantity of highly variable voice and recording co...
Despite the recent successes of neural networks in a variety of domains, musical audio modeling is s...
During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology....
This thesis demonstrates the state-of-the-art technologies in text-to-speech synthesis for the Finni...
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-lev...
The public defense on 5th June 2020 at 12:00 will be organized via remote technology. Link: https:/...
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vec...
Modeling humans’ speech is a challenging task that originally required a coalition between phonetici...
This work focuses on single-word speech recognition, where the end goal is to accurately recognize a...
Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework ...
Tässä tutkimuksessa etsittiin puhekomennontunnistusmallia, joka voitaisiin kouluttaa pienellä määräl...
Recent advances in deep learning have enabled certain systems to approach or even achieve human pari...
The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and...
The accuracy of automatic speech recognizers has been constantly improving for decades. Aalto Univer...
Most speech synthesis systems require a linguistic module to produce the features that drive the spe...