Inference-time: Steering
Manipulate an LLM's internal representations/token probabilities without touching weights.
Theory of Change:"LLMs don't seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don't want. Maybe assume that thoughts are translucent."
General Approach:Engineering
Target Case:Average Case
Outputs:
Steering Language Models with Weight Arithmetic— Constanza Fierro, Fabien Roger
EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences— Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation— Stuart Armstrong, Matija Franklin, Connor Stevens, Rebecca Gorman
In-Distribution Steering: Balancing Control and Coherence in Language Model Generation— Arthur Vogels, Benjamin Wong, Yann Choho, Annabelle Blangero, Milan Bhan