Mixup-based augmentation has been found to be effective for generalizing models during training, especially for Vision Transformers (ViTs) since they can easily overfit. However, previous mixup-based methods have an underlying prior knowledge that the linearly interpolated ratio of targets should be kept the same as the ratio proposed in input interpolation. This may lead to a strange phenomenon that sometimes there is no valid object in the mixed image due to the random process in augmentation but there is still response in the label space. To bridge such gap between the input and label spaces, we propose TransMix, which mixes labels based on the attention maps of Vision Transformers. The confidence of the label will be larger if the corre...
Vision Transformers are becoming more and more the preferred solution to many computer vision proble...
Transformers have achieved great success in natural language processing. Due to the powerful capabil...
Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention...
CutMix is a vital augmentation strategy that determines the performance and generalization ability o...
Transformers with powerful global relation modeling abilities have been introduced to fundamental co...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of c...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM metho...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
Recently, MLP-like vision models have achieved promising performances on mainstream visual recogniti...
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pr...
Transformers have become one of the dominant architectures in deep learning, particularly as a power...
Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-...
Abstract Transformers were initially introduced for natural language processing (NLP) tasks, but fas...
Vision Transformers are becoming more and more the preferred solution to many computer vision proble...
Transformers have achieved great success in natural language processing. Due to the powerful capabil...
Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention...
CutMix is a vital augmentation strategy that determines the performance and generalization ability o...
Transformers with powerful global relation modeling abilities have been introduced to fundamental co...
Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision ta...
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of c...
We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bo...
In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM metho...
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corrupt...
Recently, MLP-like vision models have achieved promising performances on mainstream visual recogniti...
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pr...
Transformers have become one of the dominant architectures in deep learning, particularly as a power...
Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-...
Abstract Transformers were initially introduced for natural language processing (NLP) tasks, but fas...
Vision Transformers are becoming more and more the preferred solution to many computer vision proble...
Transformers have achieved great success in natural language processing. Due to the powerful capabil...
Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention...