This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (>=7x7) nested in convolutional layers. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision T...
Transformers have become one of the dominant architectures in deep learning, particularly as a power...
Spatial Transformer Networks (STNs) have the potential to dramatically improve performance of convol...
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent tr...
Vision transformers have shown excellent performance in computer vision tasks. As the computation co...
The successful application of ConvNets and other neural architectures to computer vision is central ...
Vision Transformers have shown great promise recently for many vision tasks due to the insightful ar...
International audienceIn this paper, we question if self-supervised learning provides new properties...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Visual representation learning is the key of solving various vision problems. Relying on the seminal...
In recent years, convolutional networks have dramatically (re)emerged as the dominant paradigm for s...
Image representation is a key component in visual recognition systems. In visual recognition problem...
Transformers have recently shown superior performances on various vision tasks. The large, sometimes...
Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant...
Transformers have recently gained significant attention in the computer vision community. However, t...
While convolutional neural networks have shown a tremendous impact on various computer vision tasks,...
Transformers have become one of the dominant architectures in deep learning, particularly as a power...
Spatial Transformer Networks (STNs) have the potential to dramatically improve performance of convol...
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent tr...
Vision transformers have shown excellent performance in computer vision tasks. As the computation co...
The successful application of ConvNets and other neural architectures to computer vision is central ...
Vision Transformers have shown great promise recently for many vision tasks due to the insightful ar...
International audienceIn this paper, we question if self-supervised learning provides new properties...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Visual representation learning is the key of solving various vision problems. Relying on the seminal...
In recent years, convolutional networks have dramatically (re)emerged as the dominant paradigm for s...
Image representation is a key component in visual recognition systems. In visual recognition problem...
Transformers have recently shown superior performances on various vision tasks. The large, sometimes...
Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant...
Transformers have recently gained significant attention in the computer vision community. However, t...
While convolutional neural networks have shown a tremendous impact on various computer vision tasks,...
Transformers have become one of the dominant architectures in deep learning, particularly as a power...
Spatial Transformer Networks (STNs) have the potential to dramatically improve performance of convol...
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent tr...