Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Computer Vision (CV), etc. However, enormous computation workload be-comes an obstacle to train large transformer models efficiently. Recently, some methods focus on reducing the computation workload during the training by skipping some layers. How-ever, these methods use simple probability distribution and coarse-grained probability calculation, which significantly affect the model accuracy. To address the issue, in this paper we propose a novel method to accelerate training—Sensitivity-Based Layer Dropping (SBLD). SBLD uses lay-er-wise sensitivity data to switch on/off transformer layers in proper order to keep high accuracy. Besides, we adjus...
The great success of transformer-based models in natural language processing (NLP) has led to variou...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Recently, large-scale transformer-based models have been proven to be effective over various tasks a...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
Teams that have trained large Transformer-based models have reported training instabilities at large...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Training large transformer models is one of the most important computational challenges of modern AI...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
Transformer based models are used to achieve state-of-the-art performance on various deep learning t...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
The great success of transformer-based models in natural language processing (NLP) has led to variou...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Recently, large-scale transformer-based models have been proven to be effective over various tasks a...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
Teams that have trained large Transformer-based models have reported training instabilities at large...
Transformer-based neural models are used in many AI applications. Training these models is expensive...
Training large transformer models is one of the most important computational challenges of modern AI...
The computation necessary for training Transformer-based language models has skyrocketed in recent y...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Solid results from Transformers have made them prevailing architectures in various natural language ...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
Transformer based models are used to achieve state-of-the-art performance on various deep learning t...
Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexi...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
The great success of transformer-based models in natural language processing (NLP) has led to variou...
There has been an explosion of interest in designing high-performance Transformers. While Transforme...
Recently, large-scale transformer-based models have been proven to be effective over various tasks a...