One of the key factors driving the success of machine learning for scene understanding is the development of data-driven approaches that can extract information automatically from the vast expanse of data. Multimodal representation learning has emerged as one of the demanding areas to draw meaningful information from the input data and achieve human-like performance. The challenges in learning representations can be ascribed to the heterogeneity of the available datasets where the information comes from various modalities or domains such as visual signals in the form of images and videos or textual signals in form of sentences. Moreover, one encounters far more unlabeled data in the form of highly multimodal, complex image distributions. I...