Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without...
The misuse of large language models (LLMs) has garnered significant attention from the general publi...
With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has r...
The monumental achievements of deep learning (DL) systems seem to guarantee the absolute superiority...
The growing awareness of safety concerns in large language models (LLMs) has sparked considerable in...
Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignmen...
The prevalence and strong capability of large language models (LLMs) present significant safety and ...
Spurred by the recent rapid increase in the development and distribution of large language models (L...
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advanceme...
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to genera...
Large Language Models (LLMs) are central to a multitude of applications but struggle with significan...
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs...
Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe res...
Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabil...
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompt...
Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is accompa...
The misuse of large language models (LLMs) has garnered significant attention from the general publi...
With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has r...
The monumental achievements of deep learning (DL) systems seem to guarantee the absolute superiority...
The growing awareness of safety concerns in large language models (LLMs) has sparked considerable in...
Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignmen...
The prevalence and strong capability of large language models (LLMs) present significant safety and ...
Spurred by the recent rapid increase in the development and distribution of large language models (L...
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advanceme...
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to genera...
Large Language Models (LLMs) are central to a multitude of applications but struggle with significan...
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs...
Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe res...
Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabil...
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompt...
Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is accompa...
The misuse of large language models (LLMs) has garnered significant attention from the general publi...
With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has r...
The monumental achievements of deep learning (DL) systems seem to guarantee the absolute superiority...