Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for th...
Although transformer networks are recently employed in various vision tasks with outperforming perfo...
As language models have grown in parameters and layers, it has become much harder to train and infer...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Probing neural models for the ability to perform downstream tasks using their activation patterns is...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...
Fine-tuning pre-trained models have achieved impressive performance on standard natural language pro...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Model sparsification in deep learning promotes simpler, more interpretable models with fewer paramet...
Attention-based architectures have become ubiquitous in machine learning, yet our understanding of t...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Attention networks such as transformers have been shown powerful in many applications ranging from n...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention m...
Although transformer networks are recently employed in various vision tasks with outperforming perfo...
As language models have grown in parameters and layers, it has become much harder to train and infer...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It c...
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-at...
Probing neural models for the ability to perform downstream tasks using their activation patterns is...
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attentio...
Fine-tuning pre-trained models have achieved impressive performance on standard natural language pro...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
Model sparsification in deep learning promotes simpler, more interpretable models with fewer paramet...
Attention-based architectures have become ubiquitous in machine learning, yet our understanding of t...
Pretrained transformer models have demonstrated remarkable performance across various natural langua...
Attention networks such as transformers have been shown powerful in many applications ranging from n...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention m...
Although transformer networks are recently employed in various vision tasks with outperforming perfo...
As language models have grown in parameters and layers, it has become much harder to train and infer...
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse ph...