Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come with significant computational and storage costs. Quantizing these models is an effective way to alleviate this issue. However, existing methods struggle to strike a balance between model accuracy and hardware efficiency. This is where we introduce AWEQ, a post-training method that requires no additional training overhead. AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. There is an observation that weight quantization is less challenging than activation quantization. AWEQ transfers the difficulty of activation quantization to weights using channel equalization, achieving a balance between the...
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a...
Post-training quantization attracts increasing attention due to its convenience in deploying quantiz...
Data-free quantization is a task that compresses the neural network to low bit-width without access ...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for i...
Quantization has become a predominant approach for model compression, enabling deployment of large m...
We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 2...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race to...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
While neural networks have been remarkably successful in a wide array of applications, implementing ...
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a...
Post-training quantization attracts increasing attention due to its convenience in deploying quantiz...
Data-free quantization is a task that compresses the neural network to low bit-width without access ...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for i...
Quantization has become a predominant approach for model compression, enabling deployment of large m...
We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 2...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race to...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
While neural networks have been remarkably successful in a wide array of applications, implementing ...
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a...
Post-training quantization attracts increasing attention due to its convenience in deploying quantiz...
Data-free quantization is a task that compresses the neural network to low bit-width without access ...