Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. However, such powerful transformers bring a huge computation burden, because of the exhausting token-to-token comparison. The previous works focus on dropping insignificant tokens to reduce the computational cost of ViTs. But when the dropping ratio increases, this hard manner will inevitably discard the vital tokens, which limits its efficiency. To solve the issue, we propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation. As a general...
Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for ...
Vision transformers have achieved significant improvements on various vision tasks but their quadrat...
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. Howev...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in...
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural ...
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratic...
Despite the success of vision transformers (ViTs), they still suffer from significant drops in accur...
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer visio...
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their ...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Despite the recent success in many applications, the high computational requirements of vision trans...
Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical succes...
Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various c...
The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational di...
Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for ...
Vision transformers have achieved significant improvements on various vision tasks but their quadrat...
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. Howev...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in...
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural ...
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratic...
Despite the success of vision transformers (ViTs), they still suffer from significant drops in accur...
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer visio...
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their ...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Despite the recent success in many applications, the high computational requirements of vision trans...
Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical succes...
Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various c...
The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational di...
Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for ...
Vision transformers have achieved significant improvements on various vision tasks but their quadrat...
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. Howev...