This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that para...
Spoken versions of the Semantic Textual Similarity dataset for testing semantic sentence level embed...
Previous studies have reported that semantic richness facilitates visual word recognition (see, e.g....
A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs clo...
Contains fulltext : 235108.pdf (Publisher’s version ) (Open Access)Interspeech 202...
Current approaches to learning semantic representations of sentences often use prior word-level know...
Distributional semantic models capture word-level meaning that is useful in many natural language pr...
In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., usi...
Semantic representation learning for sentences is an important and well-studied problem in NLP. The ...
Humans learn language by interaction with their environment and listening to other humans. It should...
Recent advances in multimodal training use textual descriptions to significantly enhance machine und...
Calculating the semantic similarity between sentences is a long-standing problem in the area of nat...
This paper presents a novel approach for automatically generating image descriptions: visual detecto...
In this paper we present a deep learning architecture for extracting word embeddings for visual spee...
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on ...
Recent breakthroughs in NLP research, such as the advent of Transformer models have indisputably con...
Spoken versions of the Semantic Textual Similarity dataset for testing semantic sentence level embed...
Previous studies have reported that semantic richness facilitates visual word recognition (see, e.g....
A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs clo...
Contains fulltext : 235108.pdf (Publisher’s version ) (Open Access)Interspeech 202...
Current approaches to learning semantic representations of sentences often use prior word-level know...
Distributional semantic models capture word-level meaning that is useful in many natural language pr...
In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., usi...
Semantic representation learning for sentences is an important and well-studied problem in NLP. The ...
Humans learn language by interaction with their environment and listening to other humans. It should...
Recent advances in multimodal training use textual descriptions to significantly enhance machine und...
Calculating the semantic similarity between sentences is a long-standing problem in the area of nat...
This paper presents a novel approach for automatically generating image descriptions: visual detecto...
In this paper we present a deep learning architecture for extracting word embeddings for visual spee...
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on ...
Recent breakthroughs in NLP research, such as the advent of Transformer models have indisputably con...
Spoken versions of the Semantic Textual Similarity dataset for testing semantic sentence level embed...
Previous studies have reported that semantic richness facilitates visual word recognition (see, e.g....
A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs clo...