While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently...
We define a novel neuro-symbolic framework, argumentative reward learning, which combines preference...
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align wi...
This thesis explores approaches to modelling individual differences in language use. The difference...
Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LL...
Large Language Models (LLMs) have been a significant landmark of Artificial Intelligence (AI) advanc...
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled...
With the development of large language models (LLMs), striking a balance between the performance and...
The utility of reinforcement learning is limited by the alignment of reward functions with the inter...
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupe...
Deploying learning systems in the real-world requires aligning their objectives with those of the hu...
Many applications of large language models (LLMs), ranging from chatbots to creative writing, requir...
Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalitie...
Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for enhancing model safety in ...
We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If...
To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to...
We define a novel neuro-symbolic framework, argumentative reward learning, which combines preference...
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align wi...
This thesis explores approaches to modelling individual differences in language use. The difference...
Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LL...
Large Language Models (LLMs) have been a significant landmark of Artificial Intelligence (AI) advanc...
Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled...
With the development of large language models (LLMs), striking a balance between the performance and...
The utility of reinforcement learning is limited by the alignment of reward functions with the inter...
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupe...
Deploying learning systems in the real-world requires aligning their objectives with those of the hu...
Many applications of large language models (LLMs), ranging from chatbots to creative writing, requir...
Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalitie...
Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for enhancing model safety in ...
We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If...
To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to...
We define a novel neuro-symbolic framework, argumentative reward learning, which combines preference...
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align wi...
This thesis explores approaches to modelling individual differences in language use. The difference...