Packing: Towards 2x NLP BERT Acceleration

Kosec, Matej
Fu, Sheng
Krell, Mario Michael

Publication date

June 2021

Language

English

Abstract

We find that at sequence length 512 padding tokens represent in excess of 50% of the Wikipedia dataset used for pretraining BERT (Bidirectional Encoder Representations from Transformers). Therefore by removing all padding we achieve a 2x speed-up in terms of sequences/sec. To exploit this characteristic of the dataset, we develop and contrast two deterministic packing algorithms. Both algorithms rely on the assumption that sequences are interchangeable and therefore packing can be performed on the histogram of sequence lengths, rather than per sample. This transformation of the problem leads to algorithms which are fast and have linear complexity in dataset size. The shortest-pack-first histogram-packing (SPFHP) algorithm determines the pac...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Packing: Towards 2x NLP BERT Acceleration

Abstract

Extracted data

Packing: Towards 2x NLP BERT Acceleration

Abstract

Extracted data

Related items

Related items