As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We furth...
The monumental achievements of deep learning (DL) systems seem to guarantee the absolute superiority...
Pre-trained Large Language Models (LLMs) are an integral part of modern AI that have led to breakthr...
Large language models are shown to present privacy risks through memorization of training data, and ...
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align wi...
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models...
With the development of large language models (LLMs), striking a balance between the performance and...
Spurred by the recent rapid increase in the development and distribution of large language models (L...
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs...
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used ac...
Pretrained large language models (LLMs) are strong in-context learners that are able to perform few-...
The prevalence and strong capability of large language models (LLMs) present significant safety and ...
Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such ...
Adopting a two-stage paradigm of pretraining followed by fine-tuning, Pretrained Language Models (PL...
Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specializ...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...
The monumental achievements of deep learning (DL) systems seem to guarantee the absolute superiority...
Pre-trained Large Language Models (LLMs) are an integral part of modern AI that have led to breakthr...
Large language models are shown to present privacy risks through memorization of training data, and ...
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align wi...
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models...
With the development of large language models (LLMs), striking a balance between the performance and...
Spurred by the recent rapid increase in the development and distribution of large language models (L...
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs...
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used ac...
Pretrained large language models (LLMs) are strong in-context learners that are able to perform few-...
The prevalence and strong capability of large language models (LLMs) present significant safety and ...
Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such ...
Adopting a two-stage paradigm of pretraining followed by fine-tuning, Pretrained Language Models (PL...
Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specializ...
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a...
The monumental achievements of deep learning (DL) systems seem to guarantee the absolute superiority...
Pre-trained Large Language Models (LLMs) are an integral part of modern AI that have led to breakthr...
Large language models are shown to present privacy risks through memorization of training data, and ...