Shallow Review of Technical AI Safety, 2025

Activation engineering

Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.
Theory of Change:Test interpretability theories by intervening on activations; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
Target Case:Average Case
See Also:
Some names:Runjin Chen, Andy Arditi, David Krueger, Jan Wehner, Narmeen Oozeer, Reza Bayat, Adam Karvonen, Jiuding Sun, Tim Tian Hua, Helena Casademunt, Jacob Dunefsky, Thomas Marshall
Estimated FTEs:20-100
Outputs:
Activation Space Interventions Can Be Transferred Between Large Language ModelsNarmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah
HyperSteer: Activation Steering at Scale with HypernetworksJiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger
Steering Evaluation-Aware Language Models to Act Like They Are DeployedTim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda
Steering Out-of-Distribution Generalization with Concept Ablation Fine-TuningHelena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
Persona Vectors: Monitoring and Controlling Character Traits in Language ModelsRunjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Steering Large Language Model Activations in Sparse SpacesReza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent
Improving Steering Vectors by Targeting Sparse Autoencoder FeaturesSviatoslav Chalnev, Matthew Siu, Arthur Conmy
Understanding Reasoning in Thinking Language Models via Steering VectorsConstantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation ControlYuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning TasksMadeline Brumley, Joe Kwon, David Krueger, Dmitrii Krasheninnikov, Usman Anwar
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language ModelsJan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz