Breaking BERT: Evaluating and Optimizing Sparsified Attention

Brahma, Siddhartha
Zablotskaia, Polina
Mimno, David

Publication date

October 2022

Language

English

Abstract

Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for th...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Abstract

Extracted data

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Abstract

Extracted data

Related items

Related items