Activation Addition: Steering Language Models Without Optimization

Turner, Alexander Matt
Thiergart, Lisa
Udell, David
Leech, Gavin
Mini, Ulisse
MacDiarmid, Monte

Publication date

September 2023

Language

English

Abstract

Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering, and guided decoding. We instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. In particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. Unlike past work which learned these steering vectors, our Activation Addition (ActAdd) method computes them by taking the activation differences that result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet. Our inference-time approach yields control...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Activation Addition: Steering Language Models Without Optimization

Abstract

Extracted data

Activation Addition: Steering Language Models Without Optimization

Abstract

Extracted data

Related items

Related items