A long standing goal of artificial intelligence is to enable machines to perceive the visual world and interact with humans using natural language. To achieve this goal, many computer vision and natural language processing techniques have been proposed during the past decades, especially deep convolutional neural networks (CNNs). However, most previous work mainly focus on the two sides separately, and few work have been done by connecting the vision and language modalities. Hence, the semantic gap between the two modalities still exists. To solve this, the overall objective of my PhD research is to design machine learning algorithms for visual content understanding by connecting the vision and language modalities. Towards this goal, we ha...