Q-BERT: Hessian based ultra low precision quantization of BERT

Shen, S
Dong, Z
Ye, J
Ma, L
Yao, Z
Gholami, A
Mahoney, MW
Keutzer, K

Publication date

January 2020

Publisher

eScholarship, University of California

Abstract

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use Hessian-based mix-precision method to compress the mod...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Q-BERT: Hessian based ultra low precision quantization of BERT

Abstract

Extracted data

Q-BERT: Hessian based ultra low precision quantization of BERT

Abstract

Extracted data

Related items

Related items