The transformer models have shown promising effectiveness in dealing with various vision tasks. However, compared with training Convolutional Neural Network (CNN) models, training Vision Transformer (ViT) models is more difficult and relies on the large-scale training set. To explain this observation we make a hypothesis that \textit{ViT models are less effective in capturing the high-frequency components of images than CNN models}, and verify it by a frequency analysis. Inspired by this finding, we first investigate the effects of existing techniques for improving ViT models from a new frequency perspective, and find that the success of some techniques (e.g., RandAugment) can be attributed to the better usage of the high-frequency componen...
In the past few years, transformers have achieved promising performances on various computer vision ...
Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image cl...
Current researches indicate that inductive bias (IB) can improve Vision Transformer (ViT) performanc...
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of...
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range ...
Transformer design is the de facto standard for natural language processing tasks. The success of th...
The vision transformer (ViT) has advanced to the cutting edge in the visual recognition task. Transf...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tack...
Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs)...
Although transformer networks are recently employed in various vision tasks with outperforming perfo...
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratic...
Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applie...
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tas...
In the past few years, transformers have achieved promising performances on various computer vision ...
Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image cl...
Current researches indicate that inductive bias (IB) can improve Vision Transformer (ViT) performanc...
Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of...
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range ...
Transformer design is the de facto standard for natural language processing tasks. The success of th...
The vision transformer (ViT) has advanced to the cutting edge in the visual recognition task. Transf...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tack...
Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs)...
Although transformer networks are recently employed in various vision tasks with outperforming perfo...
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratic...
Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applie...
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tas...
In the past few years, transformers have achieved promising performances on various computer vision ...
Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image cl...
Current researches indicate that inductive bias (IB) can improve Vision Transformer (ViT) performanc...