Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Experiments and analysis demonstrate the effectiveness of our approach in lowering inference latency, achieving up to 1.4x speedup while preserving high performance.Comment: Technical Repor
We contribute a faster decoding algo-rithm for phrase-based machine transla-tion. Translation hypoth...
Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resourc...
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token ge...
The recent emergence of Large Language Models based on the Transformer architecture has enabled dram...
Recent advances in Transformer-based large language models (LLMs) have led to significant performanc...
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language ...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any ...
This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One ...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 ...
While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
Recent advances in Transformer-based Large Language Models have made great strides in natural langua...
Autoregressive models, despite their commendable performance in a myriad of generative tasks, face c...
We contribute a faster decoding algo-rithm for phrase-based machine transla-tion. Translation hypoth...
Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resourc...
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token ge...
The recent emergence of Large Language Models based on the Transformer architecture has enabled dram...
Recent advances in Transformer-based large language models (LLMs) have led to significant performanc...
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language ...
The recent advance of self-supervised learning associated with the Transformer architecture enables ...
In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any ...
This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One ...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Despite the huge progress in myriad generation tasks, pretrained language models (LMs) such as GPT2 ...
While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on...
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashi...
Recent advances in Transformer-based Large Language Models have made great strides in natural langua...
Autoregressive models, despite their commendable performance in a myriad of generative tasks, face c...
We contribute a faster decoding algo-rithm for phrase-based machine transla-tion. Translation hypoth...
Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resourc...
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token ge...