The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated design for different frame rates...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
In the quest to make deep learning systems more capable, a number of more complex, more computationa...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
Existing video captioning approaches typically require to first sample video frames from a decoded v...
Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-m...
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-m...
Recently, the broad adoption of the internet coupled with connected smart devices has seen a signifi...
Automatically generating natural language description for video is an extremely complicated and chal...
This work explores an efficient approach to establish a foundational video-text model for tasks incl...
Dense video captioning aims to identify the events of interest in an input video, and generate descr...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Transformer models have shown great success handling long-range interactions, making them a promisin...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
In the quest to make deep learning systems more capable, a number of more complex, more computationa...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
Existing video captioning approaches typically require to first sample video frames from a decoded v...
Transformer-based models are widely adopted in multi-modal learning as the cross-attention mechanism...
Video captioning via encoder–decoder structures is a successful sentence generation method. In addit...
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-m...
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-m...
Recently, the broad adoption of the internet coupled with connected smart devices has seen a signifi...
Automatically generating natural language description for video is an extremely complicated and chal...
This work explores an efficient approach to establish a foundational video-text model for tasks incl...
Dense video captioning aims to identify the events of interest in an input video, and generate descr...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Transformer models have shown great success handling long-range interactions, making them a promisin...
Dense video captioning is an extremely challenging task since an accurate and faithful description o...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...
In the quest to make deep learning systems more capable, a number of more complex, more computationa...
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing ...