Shallow Review of Technical AI Safety, 2025

AI explanations of AIs

Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that "searches" for a specified behaviour in frontier models.
Theory of Change:Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann
Estimated FTEs:15-30
Outputs:
Investigating truthfulness in a pre-release o3 modelNeil Chowdhury, Daniel Johnson, Vincent Huang, Jacob Steinhardt, Sarah Schwettmann
Language Model Circuits Are Sparse in the Neuron BasisAryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann
Introducing DocentKevin Meng, Vincent Huang, Jacob Steinhardt, Sarah Schwettmann