Unsupervised speech disentanglement aims at separating fast varying from slowly varying components of a speech signal. In this contribution, we take a closer look at the embedding vector representing the slowly varying signal components, commonly named the speaker embedding vector. We ask, which properties of a speaker's voice are captured and investigate to which extent do individual embedding vector components sign responsible for them, using the concept of Shapley values. Our findings show that certain speaker-specific acoustic-phonetic properties can be fairly well predicted from the speaker embedding, while the investigated more abstract voice quality features cannot.Comment: Presented at the ITG conference on Speech Communication 202
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel...
Speech events are unique; speakers do not produce the same sound in exactly the same way twice. They...
The ability to perceive sounds as words involves a transformation from detailed speech signals to in...
Disentanglement is the task of learning representations that identify and separate factors that expl...
Speaker embeddings represent a means to extract representative vectorial representations from a spee...
Speaker verification (SV) is a task to verify a claimed identity from the voice signal. A well-perfo...
Speech is a signal that includes speaker's emotion, characteristic specification, phoneme-informatio...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
Speech perception is an extremely difficult perceptual task that people do effortlessly. It requires...
The traditional approach to speech perception has relied on the assumption that speech is structure...
The performance of the automatic speaker recognition system is becoming more and more accurate, with...
Speech intelligibility assessment plays an important role in the therapy of patients suffering from ...
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel...
While promising performance for speaker verification has been achieved by deep speaker embeddings, t...
Speaker verification techniques neglect the short-time variation in the feature space even though it...
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel...
Speech events are unique; speakers do not produce the same sound in exactly the same way twice. They...
The ability to perceive sounds as words involves a transformation from detailed speech signals to in...
Disentanglement is the task of learning representations that identify and separate factors that expl...
Speaker embeddings represent a means to extract representative vectorial representations from a spee...
Speaker verification (SV) is a task to verify a claimed identity from the voice signal. A well-perfo...
Speech is a signal that includes speaker's emotion, characteristic specification, phoneme-informatio...
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervi...
Speech perception is an extremely difficult perceptual task that people do effortlessly. It requires...
The traditional approach to speech perception has relied on the assumption that speech is structure...
The performance of the automatic speaker recognition system is becoming more and more accurate, with...
Speech intelligibility assessment plays an important role in the therapy of patients suffering from ...
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel...
While promising performance for speaker verification has been achieved by deep speaker embeddings, t...
Speaker verification techniques neglect the short-time variation in the feature space even though it...
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel...
Speech events are unique; speakers do not produce the same sound in exactly the same way twice. They...
The ability to perceive sounds as words involves a transformation from detailed speech signals to in...