We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization app...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i...
Sparsity is commonly produced from model compression (i.e., pruning), which eliminates unnecessary p...
Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very ...
The growing energy and performance costs of deep learning have driven the community to reduce the si...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
Neural machine translation (NMT) strongly outperforms previous statistical techniques. With the eme...
Deep learning has been empirically successful in recent years thanks to the extremely over-parameter...
Structural neural network pruning aims to remove the redundant channels in the deep convolutional ne...
Large Language Models have become the core architecture upon which most modern natural language proc...
Efficient Transformers have been developed for long sequence modeling, due to their subquadratic mem...
Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i...
Sparsity is commonly produced from model compression (i.e., pruning), which eliminates unnecessary p...
Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very ...
The growing energy and performance costs of deep learning have driven the community to reduce the si...
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior ...
The training of sparse neural networks is becoming an increasingly important tool for reducing the ...
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DN...
With the dramatically increased number of parameters in language models, sparsity methods have recei...
Neural machine translation (NMT) strongly outperforms previous statistical techniques. With the eme...
Deep learning has been empirically successful in recent years thanks to the extremely over-parameter...
Structural neural network pruning aims to remove the redundant channels in the deep convolutional ne...
Large Language Models have become the core architecture upon which most modern natural language proc...
Efficient Transformers have been developed for long sequence modeling, due to their subquadratic mem...
Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i...
Sparsity is commonly produced from model compression (i.e., pruning), which eliminates unnecessary p...