Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality vid...
We present a method for generating a video of a talking face. The method takes as inputs: (i) still ...
Speech-driven facial animation is the process which uses speech signals to automatically synthesize ...
Generative adversarial networks (GANs) synthesize realistic samples (image, audio, video, etc.) from...
Speech is a rich biometric signal that contains information about the identity, gender and emotional...
Speech is a rich biometric signal that contains information about the identity, gender and emotional...
Speech is a rich biometric signal that contains information about the identity, gender and emotional...
This work addresses the problem of mapping fixed representations (embeddings) of a speech signal to ...
This paper presents a simple method for speech videos generation based on audio: given a piece of au...
While several research studies have focused on analyzing human behavior and, in particular, emotiona...
While several research studies have focused on analyzing human behavior and, in particular, emotiona...
We describe a method for generating a video of a talking face. The method takes still images of the ...
We propose the enhancement and evaluation of a deep neural network that is trained from scratch in a...
The objective of this paper is a neural network model that controls the pose and expression of a giv...
Generating synthesized images, being able to animate or transform them somehow, has lately been expe...
Talking face generation aims to synthesize a sequence of face images that correspond to a clip of sp...
We present a method for generating a video of a talking face. The method takes as inputs: (i) still ...
Speech-driven facial animation is the process which uses speech signals to automatically synthesize ...
Generative adversarial networks (GANs) synthesize realistic samples (image, audio, video, etc.) from...
Speech is a rich biometric signal that contains information about the identity, gender and emotional...
Speech is a rich biometric signal that contains information about the identity, gender and emotional...
Speech is a rich biometric signal that contains information about the identity, gender and emotional...
This work addresses the problem of mapping fixed representations (embeddings) of a speech signal to ...
This paper presents a simple method for speech videos generation based on audio: given a piece of au...
While several research studies have focused on analyzing human behavior and, in particular, emotiona...
While several research studies have focused on analyzing human behavior and, in particular, emotiona...
We describe a method for generating a video of a talking face. The method takes still images of the ...
We propose the enhancement and evaluation of a deep neural network that is trained from scratch in a...
The objective of this paper is a neural network model that controls the pose and expression of a giv...
Generating synthesized images, being able to animate or transform them somehow, has lately been expe...
Talking face generation aims to synthesize a sequence of face images that correspond to a clip of sp...
We present a method for generating a video of a talking face. The method takes as inputs: (i) still ...
Speech-driven facial animation is the process which uses speech signals to automatically synthesize ...
Generative adversarial networks (GANs) synthesize realistic samples (image, audio, video, etc.) from...