Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition ...
In the last years, modern action recognition frameworks with deep architectures have achieved impres...
This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained vide...
<div><p>At the interface between scene perception and speech production, we investigated how rapidly...
International audienceThis paper exploits the context of natural dynamic scenes for human action rec...
Recognizing human actions in realistic scenes has emerged as a challenging topic due to various aspe...
The aim of this paper is to address recognition of natural human actions in diverse and realistic vi...
International audienceThe aim of this paper is to address recognition of natural human actions in di...
Video action recognition has been in the center of the stage since its introduction in 2004 [SLC04]....
We address recognition and localization of human actions in realistic scenarios. In contrast to the ...
International audienceIn speech recognition, phonemes have demonstrated their ef- ficacy to model th...
This paper addresses the problem of automatic temporal annotation of realistic human actions in vide...
Recognizing the speech acts in our interlocutors’ utterances is a crucial prerequisite for conversat...
Recognizing the speech acts in our interlocutors ’ utterances is a crucial prerequisite for conversa...
This paper strives for pixel-level segmentation of actors and their actions in video content. Differ...
This work focuses the recognition of complex human activities in video data. A combination of new fe...
In the last years, modern action recognition frameworks with deep architectures have achieved impres...
This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained vide...
<div><p>At the interface between scene perception and speech production, we investigated how rapidly...
International audienceThis paper exploits the context of natural dynamic scenes for human action rec...
Recognizing human actions in realistic scenes has emerged as a challenging topic due to various aspe...
The aim of this paper is to address recognition of natural human actions in diverse and realistic vi...
International audienceThe aim of this paper is to address recognition of natural human actions in di...
Video action recognition has been in the center of the stage since its introduction in 2004 [SLC04]....
We address recognition and localization of human actions in realistic scenarios. In contrast to the ...
International audienceIn speech recognition, phonemes have demonstrated their ef- ficacy to model th...
This paper addresses the problem of automatic temporal annotation of realistic human actions in vide...
Recognizing the speech acts in our interlocutors’ utterances is a crucial prerequisite for conversat...
Recognizing the speech acts in our interlocutors ’ utterances is a crucial prerequisite for conversa...
This paper strives for pixel-level segmentation of actors and their actions in video content. Differ...
This work focuses the recognition of complex human activities in video data. A combination of new fe...
In the last years, modern action recognition frameworks with deep architectures have achieved impres...
This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained vide...
<div><p>At the interface between scene perception and speech production, we investigated how rapidly...