Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition...
Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical succes...
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of c...
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural ...
Despite the success of vision transformers (ViTs), they still suffer from significant drops in accur...
Despite the recent success in many applications, the high computational requirements of vision trans...
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural ...
Vision transformers have achieved leading performance on various visual tasks yet still suffer from ...
While state-of-the-art vision transformer models achieve promising results in image classification, ...
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exp...
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer visio...
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-a...
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Recently, the vision transformer and its variants have played an increasingly important role in both...
In this paper, we introduce a set of effective TOken REduction (TORE) strategies for Transformer-bas...
Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical succes...
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of c...
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural ...
Despite the success of vision transformers (ViTs), they still suffer from significant drops in accur...
Despite the recent success in many applications, the high computational requirements of vision trans...
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural ...
Vision transformers have achieved leading performance on various visual tasks yet still suffer from ...
While state-of-the-art vision transformer models achieve promising results in image classification, ...
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exp...
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer visio...
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-a...
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Recently, the vision transformer and its variants have played an increasingly important role in both...
In this paper, we introduce a set of effective TOken REduction (TORE) strategies for Transformer-bas...
Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical succes...
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of c...
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural ...