In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus. Furthermore, we present a method of adapting a previously trained DNN to include a new expression using a small amount of training data. Experiments show that the proposed DNN-based VTTS is preferred by 57.9% over the baseline hidden Markov model based VTTS which uses cluster adaptive training
This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable ...
Abstract — This paper describes our initial work in developing a real-time speech driven talking ava...
This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model...
In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural n...
This thesis presents the work of incorporating facial animation with emotions into a neural text-to-...
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate,...
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep...
This paper examines methods to improve visual speech synthesis from a text input using a deep neural...
This paper proposes and compares a range of methods to improve the naturalness of visual speech synt...
© 2018 IEEE. One of the advantages of statistical parametric speech synthesis is the ability to alte...
The paper aims to develop a machine learning-based system that can automatically convert text to aud...
It is in high demand to generate facial animation with high realism, but it remains a challenging ta...
Most existing Speech Emotion Recognition (SER) systems rely on turn-wise processing, which aims at r...
The recent advances in deep learning have made it possible to generate photo-realistic images by usi...
The expression of emotions in human communication plays a very important role in the information tha...
This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable ...
Abstract — This paper describes our initial work in developing a real-time speech driven talking ava...
This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model...
In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural n...
This thesis presents the work of incorporating facial animation with emotions into a neural text-to-...
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate,...
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep...
This paper examines methods to improve visual speech synthesis from a text input using a deep neural...
This paper proposes and compares a range of methods to improve the naturalness of visual speech synt...
© 2018 IEEE. One of the advantages of statistical parametric speech synthesis is the ability to alte...
The paper aims to develop a machine learning-based system that can automatically convert text to aud...
It is in high demand to generate facial animation with high realism, but it remains a challenging ta...
Most existing Speech Emotion Recognition (SER) systems rely on turn-wise processing, which aims at r...
The recent advances in deep learning have made it possible to generate photo-realistic images by usi...
The expression of emotions in human communication plays a very important role in the information tha...
This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable ...
Abstract — This paper describes our initial work in developing a real-time speech driven talking ava...
This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model...