Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by unco...
Although remarkable progress has been achieved in preventing large language model (LLM) hallucinatio...
Transformer based large language models with emergent capabilities are becoming increasingly ubiquit...
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to genera...
Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe res...
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used ac...
The misuse of large language models (LLMs) has garnered significant attention from the general publi...
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs...
Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approachin...
Spurred by the recent rapid increase in the development and distribution of large language models (L...
Large Language Models (LLMs) are central to a multitude of applications but struggle with significan...
Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is accompa...
Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attac...
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompt...
Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabil...
Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specializ...
Although remarkable progress has been achieved in preventing large language model (LLM) hallucinatio...
Transformer based large language models with emergent capabilities are becoming increasingly ubiquit...
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to genera...
Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe res...
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used ac...
The misuse of large language models (LLMs) has garnered significant attention from the general publi...
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs...
Large language models (LLMs), such as ChatGPT, have emerged with astonishing capabilities approachin...
Spurred by the recent rapid increase in the development and distribution of large language models (L...
Large Language Models (LLMs) are central to a multitude of applications but struggle with significan...
Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is accompa...
Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attac...
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompt...
Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabil...
Fine-tuning is a common and effective method for tailoring large language models (LLMs) to specializ...
Although remarkable progress has been achieved in preventing large language model (LLM) hallucinatio...
Transformer based large language models with emergent capabilities are becoming increasingly ubiquit...
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to genera...