Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose \emph{Gating Drop...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
Multi-head attention, a collection of several attention mechanisms that independently attend to diff...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
All-MLP architectures have attracted increasing interest as an alternative to attention-based models...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Com...
While transformers and their variant conformers show promising performance in speech recognition, th...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Deep neural nets with a large number of parameters are very powerful machine learning systems. Howev...
Given a large Transformer model, how can we obtain a small and computationally efficient model which...
We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regular-izing large...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increas...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
Multi-head attention, a collection of several attention mechanisms that independently attend to diff...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
All-MLP architectures have attracted increasing interest as an alternative to attention-based models...
Self-supervised training methods for transformers have demonstrated remarkable performance across va...
Transformer models are widely used in AI applications such as Natural Language Processing (NLP), Com...
While transformers and their variant conformers show promising performance in speech recognition, th...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Deep neural nets with a large number of parameters are very powerful machine learning systems. Howev...
Given a large Transformer model, how can we obtain a small and computationally efficient model which...
We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regular-izing large...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increas...
We revisit the design choices in Transformers, and propose methods to address their weaknesses in ha...
Large-scale transformer models have become the de-facto architectures for various machine learning a...
Multi-head attention, a collection of several attention mechanisms that independently attend to diff...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...