International audienceThe task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding tog...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (N...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...
International audienceThe task of retrieving video content relevant to natural language queries play...
We present an architecture for learning semantic multi-modal video representations to learn semantic...
With the current exponential growth of video-based social networks, video retrieval using natural la...
In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part...
All multimedia content, but especially video, has in recent years grown in both volume and importanc...
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent yea...
Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism...
International audienceIn this paper we tackle the issue of object instances retrieval in video repos...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
Dense video captioning is a task of localizing interesting events from an untrimmed video and produc...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the c...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (N...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...
International audienceThe task of retrieving video content relevant to natural language queries play...
We present an architecture for learning semantic multi-modal video representations to learn semantic...
With the current exponential growth of video-based social networks, video retrieval using natural la...
In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part...
All multimedia content, but especially video, has in recent years grown in both volume and importanc...
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent yea...
Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism...
International audienceIn this paper we tackle the issue of object instances retrieval in video repos...
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on ...
Dense video captioning is a task of localizing interesting events from an untrimmed video and produc...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the c...
In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projectin...
In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (N...
We explore methods to enrich the diversity of captions associated with pictures for learning improve...