AI explanations of AIs

Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that "searches" for a specified behaviour in frontier models.

Theory of Change:Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.

General Approach:Cognitive

Target Case:Pessimistic

Orthodox Problems:

7.Superintelligence can fool human supervisors 8.Superintelligence can hack software supervisors

Some names:Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann

Estimated FTEs:15-30

Outputs:

Automatically Jailbreaking Frontier Language Models with Investigator Agents

Surfacing Pathological Behaviors in Language Models

Investigating truthfulness in a pre-release o3 model— Neil Chowdhury, Daniel Johnson, Vincent Huang, Jacob Steinhardt, Sarah Schwettmann

Language Model Circuits Are Sparse in the Neuron Basis— Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

Introducing Docent— Kevin Meng, Vincent Huang, Jacob Steinhardt, Sarah Schwettmann