The objective of this work is visual recognition of speech and gestures. Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi- talker simultaneous speech, but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications. However, visual recognition of speech and gestures is a challenging problem, in part due to the lack of annotations and datasets, but also due to the inter- and intra-personal variations, and in the case of visual speech, ambiguities arising from homophones. Training a deep learning algorithm requires a lot of training data. We propose a method to automa...