Developing intelligent agents that can perceive and understand the rich visual world around us has been a long-standing goal in the field of artificial intelligence. In the last few years, significant progress has been made towards this goal and deep learning has been attributed to recent incredible advances in general visual and language understanding. Convolutional neural networks have been used to learn image representations while recurrent neural networks have demonstrated the ability to generate text from visual stimuli. In this thesis, we develop methods and techniques using hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances. Towards appreciating these methods, this...
This paper presents the integration of natural language processing and computer vision to improve th...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
University of Technology Sydney. Faculty of Engineering and Information Technology.Multi-modal perce...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
A long standing goal of artificial intelligence is to enable machines to perceive the visual world a...
Humans have an incredible ability to process and understand information from multiple sources such a...
For most people, watching a brief video and describing what happened (in words) is an easy task. For...
For most people, watching a brief video and describing what happened (in words) is an easy task. For...
Although the exponential growth of visual data in various forms, such as images and videos, enables ...
Although the exponential growth of visual data in various forms, such as images and videos, enables ...
International audienceVideo hyperlinking represents a classical example of multimodal problems. Comm...
This paper presents the integration of natural language processing and computer vision to improve th...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
Throughout the thesis, I demonstrate how each of the proposed methods can bridge the gap between ima...
University of Technology Sydney. Faculty of Engineering and Information Technology.Multi-modal perce...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a...
A long standing goal of artificial intelligence is to enable machines to perceive the visual world a...
Humans have an incredible ability to process and understand information from multiple sources such a...
For most people, watching a brief video and describing what happened (in words) is an easy task. For...
For most people, watching a brief video and describing what happened (in words) is an easy task. For...
Although the exponential growth of visual data in various forms, such as images and videos, enables ...
Although the exponential growth of visual data in various forms, such as images and videos, enables ...
International audienceVideo hyperlinking represents a classical example of multimodal problems. Comm...
This paper presents the integration of natural language processing and computer vision to improve th...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...
Large-scale pretrained foundation models have been an emerging paradigm for building artificial inte...