Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Bes...
We introduce token-consistent stochastic layers in vision transformers, without causing any severe d...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Limited computational budgets often prevent transformers from being used in production and from havi...
Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Com...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exp...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
Recently, the development of pre-trained language models has brought natural language processing (NL...
The objective of this paper is an efficient training method for video tasks. We make three contribut...
We introduce token-consistent stochastic layers in vision transformers, without causing any severe d...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Limited computational budgets often prevent transformers from being used in production and from havi...
Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Com...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
Several recent works demonstrate that transformers can implement algorithms like gradient descent. B...
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exp...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive ...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
Recently, the development of pre-trained language models has brought natural language processing (NL...
The objective of this paper is an efficient training method for video tasks. We make three contribut...
We introduce token-consistent stochastic layers in vision transformers, without causing any severe d...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Limited computational budgets often prevent transformers from being used in production and from havi...