Learning fine-grained semantics in spoken language using visual grounding

Wang, X. (author)
Tian, Tian (author)
Zhu, Jihua (author)
Scharenborg, O.E. (author)

Open link

Publication date

January 2021

DOI

10.1109/ISCAS51556.2021.9401232

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Abstract

In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., using speech and textual transcriptions. Recently, several methods have been proposed to learn speech representations using images, i.e., using visual grounding. Existing studies have focused on scene images. Here, we investigate whether fine-grained semantic information, reflecting the relationship between attributes and objects, can be learned from spoken language. To this end, a Fine-grained Semantic Embedding Network (FSEN) for learning semantic representations of spoken language grounded by fine-grained images is proposed. For training, we propose an efficient objective function, which includes a matching constraint, an adversarial objectiv...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Learning fine-grained semantics in spoken language using visual grounding

Abstract

Extracted data

Learning fine-grained semantics in spoken language using visual grounding

Abstract

Extracted data

Related items

Related items