Advances in digital technology have increased event recognition capabilities through the development of devices with high resolution, small physical dimensions and high sampling rates. The recognition of complex events in videos has several relevant applications, particularly due to the large availability of digital cameras in environments such as airports, banks, roads, among others. The large amount of data produced is the ideal scenario for the development of automatic methods based on deep learning. Despite the significant progress achieved through image-based deep networks, video understanding still faces challenges in modeling spatio-temporal relations. In this work, we address the problem of human action recognition in videos. A mult...