QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Ashkboos, Saleh
Markov, Ilia
Frantar, Elias
Zhong, Tingxuan
Wang, Xincheng
Ren, Jie
Hoefler, Torsten
Alistarh, Dan

Publication date

November 2023

Language

English

Abstract

Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations be...

Extracted data

We use cookies to provide a better user experience.

Data Protection

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Abstract

Extracted data

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Abstract

Extracted data

Related items

Related items