Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs that can be implemented efficiently. We observe that systematic outliers appear at fixed activation channels. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization dif...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Quantization emerges as one of the most promising approaches for deploying advanced deep models on r...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race to...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for i...
Quantization has become a predominant approach for model compression, enabling deployment of large m...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 2...
Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by ch...
While neural networks have been remarkably successful in a wide array of applications, implementing ...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Quantization emerges as one of the most promising approaches for deploying advanced deep models on r...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race to...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for i...
Quantization has become a predominant approach for model compression, enabling deployment of large m...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 2...
Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by ch...
While neural networks have been remarkably successful in a wide array of applications, implementing ...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Quantization emerges as one of the most promising approaches for deploying advanced deep models on r...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...