AI explanations of AIs
Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that "searches" for a specified behaviour in frontier models.
Theory of Change:Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann
Estimated FTEs:15-30
Outputs:
Investigating truthfulness in a pre-release o3 model— Neil Chowdhury, Daniel Johnson, Vincent Huang, Jacob Steinhardt, Sarah Schwettmann
Language Model Circuits Are Sparse in the Neuron Basis— Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann
Introducing Docent— Kevin Meng, Vincent Huang, Jacob Steinhardt, Sarah Schwettmann