Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.Cataloged from PDF version of thesis.Includes bibliographical references (pages 145-159).Humans learn language at an early age by simply observing the world around them. Why can't computers do the same? Conventional automatic speech recognition systems have a long history and have recently made great strides thanks to the revival of deep neural networks. However, their reliance on highly supervised (and therefore expensive) training paradigms has restricted their application to the major languages of the world, accounting for a small fraction of the more than 7,000 human languages spoken worldwide. This thesis introduces da...
Automatic speech recognition (ASR) technologies have been successfully applied to most of the majo...
Automatic speech recognition (ASR) technologies have been successfully applied to most of the majo...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
Humans learn language by interaction with their environment and listening to other humans. It should...
In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., usi...
Automatic speech recognition has seen recent advancements powered by machine learning, but it is sti...
Text-based technologies, such as text translation from one language to another, and image captioning...
This electronic version was submitted by the student author. The certified thesis is available in th...
Thesis (Ph.D.)--University of Washington, 2022As humans, our understanding of language is grounded i...
Speech Recognition has become prevalent over the years due to its ability to do information search, ...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we explore neural network models that learn to associate segments of spoken audio cap...
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on ...
Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences wi...
A widespread approach to processing spoken language is to first automatically transcribe it into tex...
Automatic speech recognition (ASR) technologies have been successfully applied to most of the majo...
Automatic speech recognition (ASR) technologies have been successfully applied to most of the majo...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...
Humans learn language by interaction with their environment and listening to other humans. It should...
In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., usi...
Automatic speech recognition has seen recent advancements powered by machine learning, but it is sti...
Text-based technologies, such as text translation from one language to another, and image captioning...
This electronic version was submitted by the student author. The certified thesis is available in th...
Thesis (Ph.D.)--University of Washington, 2022As humans, our understanding of language is grounded i...
Speech Recognition has become prevalent over the years due to its ability to do information search, ...
This electronic version was submitted by the student author. The certified thesis is available in th...
In this paper, we explore neural network models that learn to associate segments of spoken audio cap...
We investigated word recognition in a Visually Grounded Speech model. The model has been trained on ...
Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences wi...
A widespread approach to processing spoken language is to first automatically transcribe it into tex...
Automatic speech recognition (ASR) technologies have been successfully applied to most of the majo...
Automatic speech recognition (ASR) technologies have been successfully applied to most of the majo...
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models c...