nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Park, Gunho
Park, Baeseong
Lee, Sungjae
Kim, Minsub
Kim, Byeongwook
Kwon, Se Jung
Lee, Youngjoo
Lee, Dongsoo

Publication date

November 2022

Language

English

Abstract

The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size and, thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for large-scale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LM...

Extracted data

We use cookies to provide a better user experience.

Data Protection

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Abstract

Extracted data

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Abstract

Extracted data

Related items

Related items