We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with th...
Vision Transformer (ViT) has been proposed as a new image recognition method in the field of compute...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
Object detection, which aims to recognize and locate objects within images using bounding boxes, is ...
We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-traine...
This work presents a simple vision transformer design as a strong baseline for object localization a...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Various models have been proposed to perform object detection. However, most require many handdesign...
What constitutes an object? This has been a long-standing question in computer vision. Towards this ...
International audienceIn this paper, we question if self-supervised learning provides new properties...
In this article, a novel real-time object detector called Transformers Only Look Once (TOLO) is prop...
This paper presents a new model for multi-object tracking (MOT) with a transformer. MOT is a spatiot...
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tas...
Modern top-performing object detectors depend heavily on backbone networks, whose advances bring con...
Vision transformers have recently demonstrated great success in various computer vision tasks, motiv...
International audienceThe use of pretrained deep neural networks represents an attractive alternativ...
Vision Transformer (ViT) has been proposed as a new image recognition method in the field of compute...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
Object detection, which aims to recognize and locate objects within images using bounding boxes, is ...
We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-traine...
This work presents a simple vision transformer design as a strong baseline for object localization a...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Various models have been proposed to perform object detection. However, most require many handdesign...
What constitutes an object? This has been a long-standing question in computer vision. Towards this ...
International audienceIn this paper, we question if self-supervised learning provides new properties...
In this article, a novel real-time object detector called Transformers Only Look Once (TOLO) is prop...
This paper presents a new model for multi-object tracking (MOT) with a transformer. MOT is a spatiot...
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tas...
Modern top-performing object detectors depend heavily on backbone networks, whose advances bring con...
Vision transformers have recently demonstrated great success in various computer vision tasks, motiv...
International audienceThe use of pretrained deep neural networks represents an attractive alternativ...
Vision Transformer (ViT) has been proposed as a new image recognition method in the field of compute...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
Object detection, which aims to recognize and locate objects within images using bounding boxes, is ...