Learning similarity and distance measures has become increasingly important for the analysis, matching, retrieval, recognition, and categorization of video and multimedia data. With the ubiquitous use of digital imaging devices, mobile terminals and social networks, there are massive volumes of heterogeneous and homogeneous video and multimedia data from multiple sources, views, and domains, e.g., news media websites, microblog, mobile phone, social networking, etc. Similarity and distance-based constraints can also be extended and incorporated to boost classification and relationship learning. Moreover, the spatio-temporal coherence among video data can also be utilized for self-supervised learning of similarity and distance metrics. This ...