Our experience of the world is multimodal - we see objects, hear sounds, and read texts to perceive information. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. The heterogeneity of the data brings unique challenges while working with multimodal signals. One such challenge is to identify and understand the alignment between two different modalities. In this dissertation, we focus on learning to align vision and language modalities in static and dynamic tasks in different scenarios.In the first dimension, we address the task of text-based video moment localization. Existing approaches assume that the relevant video is already known/...
International audienceReal-world phenomena involve complex interactions between multiple signal moda...
With the current exponential growth of video-based social networks, video retrieval using natural la...
Systems that can find correspondences between multiple modal- ities, such as between speech and imag...
The correlation between the vision and text is essential for video moment retrieval (VMR), however, ...
2019-01-29Multimodal reasoning focuses on learning the correlation between different modalities pres...
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent yea...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically ...
This paper studies the problem of temporal moment localization in a long untrimmed video using natur...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
International audienceReal-world phenomena involve complex interactions between multiple signal moda...
With the current exponential growth of video-based social networks, video retrieval using natural la...
Systems that can find correspondences between multiple modal- ities, such as between speech and imag...
The correlation between the vision and text is essential for video moment retrieval (VMR), however, ...
2019-01-29Multimodal reasoning focuses on learning the correlation between different modalities pres...
Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-...
In recent years, tremendous success has been achieved in many computer vision tasks using deep learn...
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent yea...
This work aims to temporally localize events that are both audible and visible in video. Previous me...
People typically learn through exposure to visual facts associated with linguistic descriptions. For...
Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically ...
This paper studies the problem of temporal moment localization in a long untrimmed video using natur...
Localizing the audio-visual events in video requires a combined judgment of visual and audio compone...
In human multi-modality perception systems, the benefits of integrating auditory and visual informat...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
International audienceReal-world phenomena involve complex interactions between multiple signal moda...
With the current exponential growth of video-based social networks, video retrieval using natural la...
Systems that can find correspondences between multiple modal- ities, such as between speech and imag...