Eyes and ears together: new task for multimodal spoken content analysis

Moriya, Yasufumi
Sanabria, Ramon
Metze, Florian
Jones, Gareth J.F.

Publication date

October 2018

Publisher

CEUR-WS

Abstract

Human speech processing is often a multimodal process combining audio and visual processing. Eyes and Ears Together proposes two benchmark multimodal speech processing tasks: (1) multimodal automatic speech recognition (ASR) and (2) multimodal co-reference resolution on the spoken multimedia. These tasks are motivated by our desire to address the difficulties of ASR for multimedia spoken content. We review prior work on the integration of multimodal signals into speech processing for multimedia data, introduce a multimedia dataset for our proposed tasks, and outline these tasks

Extracted data

We use cookies to provide a better user experience.

Data Protection

Eyes and ears together: new task for multimodal spoken content analysis

Abstract

Extracted data

Eyes and ears together: new task for multimodal spoken content analysis

Abstract

Extracted data

Related items

Related items