In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer vision algorithms are used to recognize the objects in the scene and automatic real-time head pose estimation is done using depth data captured via a Microsoft Kinect. The system was evaluated on multiple p...
Biometrics has been a topic of great interest since the advent of the information age and will soon ...
AbstractThis paper presents an Active Appearance Model (AAM) based multiple camera visual speech rec...
Visual speech cues are known to improve the performance of automatic speech recognition (ASR). Howev...
In this paper, we present a multimodal speech recognition system for real world scene description ta...
This thesis describes how an automatic lip reader was realized. Visual speech recognition is a preco...
In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-i...
In audio-visual automatic speech recognition (AVASR), no research to date has been conducted into th...
A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs clo...
Automatic speechreading systems use both acoustic and visual signals to perform speech recognition. ...
In this paper we build on our recent work, where we successfully incorporated facial depth data of a...
Spoken dialog systems have matured to the point where the underlying technologies of speech recognit...
Abstract. In this paper an evaluation of visual speech features is performed specifically for the ta...
In automatic lipreading, the speaker's head movement can affect the mouth shape appearing in th...
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on ...
Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to ...
Biometrics has been a topic of great interest since the advent of the information age and will soon ...
AbstractThis paper presents an Active Appearance Model (AAM) based multiple camera visual speech rec...
Visual speech cues are known to improve the performance of automatic speech recognition (ASR). Howev...
In this paper, we present a multimodal speech recognition system for real world scene description ta...
This thesis describes how an automatic lip reader was realized. Visual speech recognition is a preco...
In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-i...
In audio-visual automatic speech recognition (AVASR), no research to date has been conducted into th...
A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs clo...
Automatic speechreading systems use both acoustic and visual signals to perform speech recognition. ...
In this paper we build on our recent work, where we successfully incorporated facial depth data of a...
Spoken dialog systems have matured to the point where the underlying technologies of speech recognit...
Abstract. In this paper an evaluation of visual speech features is performed specifically for the ta...
In automatic lipreading, the speaker's head movement can affect the mouth shape appearing in th...
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on ...
Abstract — Visual speech information from the speaker’s mouth region has been successfully shown to ...
Biometrics has been a topic of great interest since the advent of the information age and will soon ...
AbstractThis paper presents an Active Appearance Model (AAM) based multiple camera visual speech rec...
Visual speech cues are known to improve the performance of automatic speech recognition (ASR). Howev...