Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Liu, Rui
Kim, Young Jin
Muzio, Alexandre
Awadalla, Hany Hassan

Publication date

July 2022

Abstract

Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose \emph{Gating Drop...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Abstract

Extracted data

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Abstract

Extracted data

Related items

Related items