The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size and, thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for large-scale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LM...
Graph Neural Network (GNN) training and inference involve significant challenges of scalability with...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translati...
Scaling language models with more data, compute and parameters has driven significant progress in na...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race to...
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
Limited computational budgets often prevent transformers from being used in production and from havi...
Graph Neural Network (GNN) training and inference involve significant challenges of scalability with...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translati...
Scaling language models with more data, compute and parameters has driven significant progress in na...
The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the deman...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race to...
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
As the training of giant dense models hits the boundary on the availability and capability of the ha...
With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
There are growing interests in adapting large-scale language models using parameter-efficient fine-t...
Existing large language models have to run K times to generate a sequence of K tokens. In this paper...
Limited computational budgets often prevent transformers from being used in production and from havi...
Graph Neural Network (GNN) training and inference involve significant challenges of scalability with...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translati...