Activation engineering
Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.
Theory of Change:Test interpretability theories by intervening on activations; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
Target Case:Average Case
Orthodox Problems:
See Also:
Some names:Runjin Chen, Andy Arditi, David Krueger, Jan Wehner, Narmeen Oozeer, Reza Bayat, Adam Karvonen, Jiuding Sun, Tim Tian Hua, Helena Casademunt, Jacob Dunefsky, Thomas Marshall
Estimated FTEs:20-100
Outputs:
Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers— Ruben Belo, Marta Guimaraes, Claudia Soares
Activation Space Interventions Can Be Transferred Between Large Language Models— Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah
HyperSteer: Activation Steering at Scale with Hypernetworks— Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger
Steering Evaluation-Aware Language Models to Act Like They Are Deployed— Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning— Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
Persona Vectors: Monitoring and Controlling Character Traits in Language Models— Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Steering Large Language Model Activations in Sparse Spaces— Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent
Improving Steering Vectors by Targeting Sparse Autoencoder Features— Sviatoslav Chalnev, Matthew Siu, Arthur Conmy
Understanding Reasoning in Thinking Language Models via Steering Vectors— Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
One-shot steering vectors cause emergent misalignment, too— Jacob Dunefsky
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control— Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks— Madeline Brumley, Joe Kwon, David Krueger, Dmitrii Krasheninnikov, Usman Anwar
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models— Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz
Robustly Improving LLM Fairness in Realistic Settings via Interpretability— Adam Karvonen, Samuel Marks