Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from PDF version of thesis.Includes bibliographical references (pages 103-107).In this thesis, I explore state of the art techniques for using neural networks to learn semantically-rich representations for visual and audio data. In particular, I analyze and extend the model introduced by Harwath et al. (2016), a neural architecture which learns a non-linear similarity metric between images and audio captions using sampled margin rank loss. In Chapter 1, I provide a back...
We introduce two multimodal neural lan-guage models: models of natural language that can be conditio...
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video r...
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video r...
In this paper, we explore neural network models that learn to associate segments of spoken audio cap...
Visually grounded speech representation learning has shown to be useful in the field of speech repre...
Abstract In this paper, we explore neural network models that learn to associate segments of spoken...
Abstract In this paper, we explore neural network models that learn to associate segments of spoken...
This paper explores the possibility to learn a semantically-relevant lexicon from images and speech ...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
<p>This paper explores the possibility to learn a semantically-relevant lexicon from images and spee...
2019-01-29Multimodal reasoning focuses on learning the correlation between different modalities pres...
Speech is a natural way of communicating that does not require us to develop any new skills in order...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framewor...
In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., usi...
We introduce two multimodal neural lan-guage models: models of natural language that can be conditio...
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video r...
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video r...
In this paper, we explore neural network models that learn to associate segments of spoken audio cap...
Visually grounded speech representation learning has shown to be useful in the field of speech repre...
Abstract In this paper, we explore neural network models that learn to associate segments of spoken...
Abstract In this paper, we explore neural network models that learn to associate segments of spoken...
This paper explores the possibility to learn a semantically-relevant lexicon from images and speech ...
Deep learning has fueled an explosion of applications, yet training deep neural networks usually req...
<p>This paper explores the possibility to learn a semantically-relevant lexicon from images and spee...
2019-01-29Multimodal reasoning focuses on learning the correlation between different modalities pres...
Speech is a natural way of communicating that does not require us to develop any new skills in order...
Developing intelligent agents that can perceive and understand the rich visual world around us has b...
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framewor...
In the case of unwritten languages, acoustic models cannot be trained in the standard way, i.e., usi...
We introduce two multimodal neural lan-guage models: models of natural language that can be conditio...
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video r...
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video r...