Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations be...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these model...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Scaling language models with more data, compute and parameters has driven significant progress in na...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these model...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown e...
Scaling language models with more data, compute and parameters has driven significant progress in na...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these model...