Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradie...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
In this paper, we study the computation of how much an input token in a Transformer model influences...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...
Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been...
Transformer-based models have brought a radical change to neural machine translation. A key feature ...
The attention mechanism is the key to many state-of-the-art transformer-based models in Natural Lang...
The use of Transformers outside the realm of natural language processing is becoming more and more p...
International audienceMultimodal Deep Learning has garnered much interest, and transformers have tri...
Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot i...
The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On o...
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exp...
Transformer models have achieved state-of-the-art results across a diverse range of domains. However...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
In this paper, we study the computation of how much an input token in a Transformer model influences...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...
Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for...
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkab...
Transformers are the state-of-the-art for machine translation and grammar error correction. One of t...
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been...
Transformer-based models have brought a radical change to neural machine translation. A key feature ...
The attention mechanism is the key to many state-of-the-art transformer-based models in Natural Lang...
The use of Transformers outside the realm of natural language processing is becoming more and more p...
International audienceMultimodal Deep Learning has garnered much interest, and transformers have tri...
Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot i...
The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On o...
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exp...
Transformer models have achieved state-of-the-art results across a diverse range of domains. However...
Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due ...
In this paper, we study the computation of how much an input token in a Transformer model influences...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...