Self-supervised speech models have grown fast during the past few years and have proven feasible for use in various downstream tasks. Some recent work has started to look at the characteristics of these models, yet many concerns have not been fully addressed. In this work, we conduct a study on emotional corpora to explore a popular self-supervised model -- wav2vec 2.0. Via a set of quantitative analysis, we mainly demonstrate that: 1) wav2vec 2.0 appears to discard paralinguistic information that is less useful for word recognition purposes; 2) for emotion recognition, representations from the middle layer alone perform as well as those derived from layer averaging, while the final layer results in the worst performance in some cases; 3) c...
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by ena...
Recent advances with self-supervised learning have allowed speech recognition systems to achieve sta...
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases....
Human emotion understanding is pivotal in making conversational technology mainstream. We view speec...
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently ...
Self-supervised learning has recently been implemented widely in speech processing areas, replacing ...
Self-supervised pre-training could effectively improve the performance of low-resource automatic spe...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of...
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing...
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently ...
Self-supervised speech recognition models require considerable labeled training data for learning hi...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by ena...
Recent advances with self-supervised learning have allowed speech recognition systems to achieve sta...
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases....
Human emotion understanding is pivotal in making conversational technology mainstream. We view speec...
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently ...
Self-supervised learning has recently been implemented widely in speech processing areas, replacing ...
Self-supervised pre-training could effectively improve the performance of low-resource automatic spe...
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme re...
Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of...
Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing...
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently ...
Self-supervised speech recognition models require considerable labeled training data for learning hi...
Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR)...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited explorati...
Advances in self-supervised learning have significantly reduced the amount of transcribed audio requ...
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by ena...
Recent advances with self-supervised learning have allowed speech recognition systems to achieve sta...
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases....