In this thesis, we address the two problem of tool detection and fine-grained activity recognition in the operating room (OR), which are key ingredients in the development of surgical assistance applications. Leveraging weak supervision for temporal modeling and spatial localization, we propose a joint detection and tracking model for surgical instruments, circumventing the lack of spatially annotated dataset on this task. For a more helpful AI assistance in the OR, we formalize surgical activities as triplets of , and propose several deep learning methods, that leverages instrument's activation, spatial attention, and semantic attention mechanisms, to recognize these triplets directly from surgical videos. Evaluation is performed on large ...