Reverse engineering
Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's internal algorithm.
Theory of Change:By gaining a mechanical understanding of how a model works (the "circuit diagram"), we can predict how models will act in novel situations (generalization), and gain the mechanistic knowledge necessary to safely modify an AI's goals or internal mechanisms, or allow for high-confidence alignment auditing and better feedback to safety researchers.
General Approach:Cognitive
Target Case:Worst Case
See Also:
Some names:Lucius Bushnaq, Dan Braun, Lee Sharkey, Aaron Mueller, Atticus Geiger, Sheridan Feucht, David Bau, Yonatan Belinkov, Stefan Heimersheim, Chris Olah, Leo Gao
Estimated FTEs:100-200
Critiques:
Interpretability Will Not Reliably Find Deceptive AI, A Problem to Solve Before Building a Deception Detector, MoSSAIC: AI Safety After Mechanism, The Misguided Quest for Mechanistic AI Interpretability. Mechanistic?, Assessing skeptical views of interpretability research, Activation space interpretability may be doomed, A Pragmatic Vision for Interpretability
Outputs:
The Circuits Research Landscape: Results and Perspectives— Jack Lindsey, Emmanuel Ameisen, Neel Nanda, Stepan Shabalin, Mateusz Piotrowski, Tom McGrath, Michael Hanna, Owen Lewis, Curt Tigges, Jack Merullo, Connor Watts, Gonçalo Paulo, Joshua Batson, Liv Gorton, Elana Simon, Max Loeffler, Callum McDougall, Johnny Lin
Circuits in Superposition: Compressing many small neural networks into one— Lucius Bushnaq, jake_mendel
MIB: A Mechanistic Interpretability Benchmark— Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching— Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda
The Dual-Route Model of Induction— Sheridan Feucht, Eric Todd, Byron Wallace, David Bau
Structural Inference: Interpreting Small Language Models with Susceptibilities— Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet
Stochastic Parameter Decomposition— Dan Braun, Lucius Bushnaq, Lee Sharkey
The Geometry of Self-Verification in a Task-Specific Reasoning Model— Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg
Converting MLPs into Polynomials in Closed Form— Nora Belrose, Alice Rigg
Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts— Jiahai Feng, Stuart Russell, Jacob Steinhardt
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition— Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, Lee Sharkey
Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition— Brianna Chrisman, Lucius Bushnaq, Lee Sharkey
From Memorization to Reasoning in the Spectrum of Loss Curvature— Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers— Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
How Do LLMs Perform Two-Hop Reasoning in Context?— Tianyu Guo, Hanlin Zhu, Ruiqi Zhang, Jiantao Jiao, Song Mei, Michael I. Jordan, Stuart Russell
Blink of an eye: a simple theory for feature localization in generative models— Marvin Li, Aayush Karan, Sitan Chen
On the creation of narrow AI: hierarchy and nonlocality of neural network skills— Eric J. Michaud, Asher Parker-Sartori, Max Tegmark
Interpreting Emergent Planning in Model-Free Reinforcement Learning— Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger
Building and evaluating alignment auditing agents— Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein, evhub
How Do Transformers Learn Variable Binding in Symbolic Programs?— Yiwei Wu, Atticus Geiger, Raphaël Millière
Fresh in memory: Training-order recency is linearly encoded in language model activations— Dmitrii Krasheninnikov, Richard E. Turner, David Krueger
Language Models use Lookbacks to Track Beliefs— Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger
Constrained belief updates explain geometric structures in transformer representations— Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam S. Shai
LLMs Process Lists With General Filter Heads— Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau
Language Models Use Trigonometry to Do Addition— Subhash Kantamneni, Max Tegmark
Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban— Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso
Transformers Struggle to Learn to Search— Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He
Adversarial Examples Are Not Bugs, They Are Superposition— Liv Gorton, Owen Lewis
Do Language Models Use Their Depth Efficiently?— Róbert Csordás, Christopher D. Manning, Christopher Potts
ICLR: In-Context Learning of Representations— Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, Hidenori Tanaka