Activation Addition: Steering Language Models Without Optimization

08/20/2023
by   Alex Turner, et al.
0

Reliably controlling the behavior of large language models (LLMs) is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback (RLHF), prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. In particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. Unlike past work which learned these steering vectors (Subramani, Suresh, and Peters 2022; Hernandez, Li, and Andreas 2023), our Activation Addition (ActAdd) method computes them by taking the activation differences that result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet. Our inference-time approach yields control over high-level properties of output and preserves off-target model performance. It involves far less compute and implementation effort compared to finetuning or RLHF, allows users to provide natural language specifications, and its overhead scales naturally with model size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2023

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Large-scale language models (LLMs) have demonstrated outstanding perform...
research
11/02/2022

Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment Analysis

Natural language models are often summarized through a high-dimensional ...
research
05/10/2022

Reducing Activation Recomputation in Large Transformer Models

Training large transformer models is one of the most important computati...
research
07/24/2023

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

Prompt engineering is a technique that involves augmenting a large pre-t...
research
03/09/2023

Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification

To make robots accessible to a broad audience, it is critical to endow t...

Please sign up or login with your details

Forgot password? Click here to reset