Other interpretability

Interpretability that does not fall well into other categories.

Theory of Change:Explore alternative conceptual frameworks (e.g., agentic, propositional) and physics-inspired methods (e.g., renormalization). Or be "pragmatic".

Orthodox Problems:

7.Superintelligence can fool human supervisors 4.Goals misgeneralize out of distribution

Some names:Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau

Estimated FTEs:30-60

Critiques:

The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI.

Outputs:

Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability— submarat, Joachim Schaeffer, Luca Baroni, galvsk, StefanHex

Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing— Zhe Li, Wei Zhao, Yige Li, Jun Sun

Open Problems in Mechanistic Interpretability— Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath

Against blanket arguments against interpretability— Dmitry Vaintrob

Opportunity Space: Renormalization for AI Safety— Lauren Greenspan, Dmitry Vaintrob, Lucas Teixeira

Prospects for Alignment Automation: Interpretability Case Study— Jacob Pfau, Geoffrey Irving

The Urgency of Interpretability— Dario Amodei

Language Models May Verbatim Complete Text They Were Not Explicitly Trained On— Ken Ziyu Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot

Downstream applications as validation of interpretability progress— Sam Marks

Principles for Picking Practical Interpretability Projects— Sam Marks

Propositional Interpretability in Artificial Intelligence— David J. Chalmers

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey— Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, Xuming Hu

Renormalization Redux: QFT Techniques for AI Interpretability— Lauren Greenspan, Dmitry Vaintrob

The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability— Kola Ayonrinde, Louis Jaburi

Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation— Farzaneh Mahdisoltani, Saeed Mahdisoltani, Roger B. Grosse, David J. Fleet

Call for Collaboration: Renormalization for AI safety— Lauren Greenspan

On the creation of narrow AI: hierarchy and nonlocality of neural network skills— Eric J. Michaud, Asher Parker-Sartori, Max Tegmark

Harmonic Loss Trains Interpretable AI Models— David D. Baek, Ziming Liu, Riya Tyagi, Max Tegmark

Extracting memorized pieces of (copyrighted) books from open-weight language models— A. Feder Cooper, Aaron Gokaslan, Ahmed Ahmed, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang