In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baselin...
Emotion recognition from speech may play a crucial role in many applications related to human–comput...
This paper describes a revealing robust spectral feature for speech emotion recognition using Deep N...
The redundant information, noise data generated in the process of single-modal feature extraction, a...
Introduction The effective fusion of text and audio information for categorical and dimensional spe...
Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition ...
Automatic speech emotion recognition (SER) by a computer is a critical component for more natural hu...
In this paper we present a Convolutional Neural Network for multilingual emotion recognition from sp...
There is an apparent evolving interest in speech emotion recognition (SER), one of the particular c...
Abstract Automatic affect recognition is a challenging task due to the various modalities emotions ...
Speech emotion recognition (SER) is a challenging task since it is unclear what kind of features are...
The curse of dimensionality is a well-established phenomenon. However, the properties of high dimens...
Speech emotion recognition (SER) is a challenging task since it is unclear what kind of features are...
We present our system description of input-levelmultimodal fusion of audio, video, and text forrecog...
Emotion recognition has become one of the most researched subjects in the scientific community, espe...
Speech is an efficient agent to explicit attitude and emotions via language. The crucial task for th...
Emotion recognition from speech may play a crucial role in many applications related to human–comput...
This paper describes a revealing robust spectral feature for speech emotion recognition using Deep N...
The redundant information, noise data generated in the process of single-modal feature extraction, a...
Introduction The effective fusion of text and audio information for categorical and dimensional spe...
Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition ...
Automatic speech emotion recognition (SER) by a computer is a critical component for more natural hu...
In this paper we present a Convolutional Neural Network for multilingual emotion recognition from sp...
There is an apparent evolving interest in speech emotion recognition (SER), one of the particular c...
Abstract Automatic affect recognition is a challenging task due to the various modalities emotions ...
Speech emotion recognition (SER) is a challenging task since it is unclear what kind of features are...
The curse of dimensionality is a well-established phenomenon. However, the properties of high dimens...
Speech emotion recognition (SER) is a challenging task since it is unclear what kind of features are...
We present our system description of input-levelmultimodal fusion of audio, video, and text forrecog...
Emotion recognition has become one of the most researched subjects in the scientific community, espe...
Speech is an efficient agent to explicit attitude and emotions via language. The crucial task for th...
Emotion recognition from speech may play a crucial role in many applications related to human–comput...
This paper describes a revealing robust spectral feature for speech emotion recognition using Deep N...
The redundant information, noise data generated in the process of single-modal feature extraction, a...