This thesis addresses two visual understanding tasks: visual relationship detection (VRD) and video action recognition. The majority of the thesis is focused on VRD, which is our main contribution. Relations amongst entities play a central role in image and video understanding. In the first three chapters, we discuss visual relationship detection, whose goal is to recognize all (subject, predicate, object) tuples in a given image. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE [1], we propose a context-augmented t...
The ability to accurately understand how humans interact with their surroundings is critical for man...
Inferring the interactions between objects, a.k.a visual relationship detection, is a crucial point ...
Event models obtained automatically from video can be used in applications ranging from abnormal eve...
Visual Relationship Detection (VRD) is a relatively young research area, where the goal is to develo...
Identifying relations between objects is central to understanding the scene. While several works hav...
Detecting relations among objects is a crucial task for image understanding. However, each relations...
Visual Relationship Detection (VRD) aims to understand real-world objects' interactions by grounding...
Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilitie...
Visual relationship detection aims to completely understand visual scenes and has recently received ...
This thesis contributes to the field of machine learning with a specific focus on the methods for le...
Large scale visual understanding is challenging, as it requires a model to handle the widely-spread ...
The aim of visual relation detection is to provide a comprehensive understanding of an image by desc...
The aim of visual relation detection is to provide a comprehensive understanding of an image by desc...
International audienceThis paper introduces a novel approach for modeling visual relations between p...
Visual relation detection (VRD) aims to describe all interacting objects in an image using subject-p...
The ability to accurately understand how humans interact with their surroundings is critical for man...
Inferring the interactions between objects, a.k.a visual relationship detection, is a crucial point ...
Event models obtained automatically from video can be used in applications ranging from abnormal eve...
Visual Relationship Detection (VRD) is a relatively young research area, where the goal is to develo...
Identifying relations between objects is central to understanding the scene. While several works hav...
Detecting relations among objects is a crucial task for image understanding. However, each relations...
Visual Relationship Detection (VRD) aims to understand real-world objects' interactions by grounding...
Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilitie...
Visual relationship detection aims to completely understand visual scenes and has recently received ...
This thesis contributes to the field of machine learning with a specific focus on the methods for le...
Large scale visual understanding is challenging, as it requires a model to handle the widely-spread ...
The aim of visual relation detection is to provide a comprehensive understanding of an image by desc...
The aim of visual relation detection is to provide a comprehensive understanding of an image by desc...
International audienceThis paper introduces a novel approach for modeling visual relations between p...
Visual relation detection (VRD) aims to describe all interacting objects in an image using subject-p...
The ability to accurately understand how humans interact with their surroundings is critical for man...
Inferring the interactions between objects, a.k.a visual relationship detection, is a crucial point ...
Event models obtained automatically from video can be used in applications ranging from abnormal eve...