Other interpretability
Interpretability that does not fall well into other categories.
Theory of Change:Explore alternative conceptual frameworks (e.g., agentic, propositional) and physics-inspired methods (e.g., renormalization). Or be "pragmatic".
Some names:Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau
Estimated FTEs:30-60
Outputs:
Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability— submarat, Joachim Schaeffer, Luca Baroni, galvsk, StefanHex
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing— Zhe Li, Wei Zhao, Yige Li, Jun Sun
Open Problems in Mechanistic Interpretability— Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
Against blanket arguments against interpretability— Dmitry Vaintrob
Opportunity Space: Renormalization for AI Safety— Lauren Greenspan, Dmitry Vaintrob, Lucas Teixeira
Prospects for Alignment Automation: Interpretability Case Study— Jacob Pfau, Geoffrey Irving
The Urgency of Interpretability— Dario Amodei
Language Models May Verbatim Complete Text They Were Not Explicitly Trained On— Ken Ziyu Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot
Propositional Interpretability in Artificial Intelligence— David J. Chalmers
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey— Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, Xuming Hu
Renormalization Redux: QFT Techniques for AI Interpretability— Lauren Greenspan, Dmitry Vaintrob
The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability— Kola Ayonrinde, Louis Jaburi
Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based Extrapolation— Farzaneh Mahdisoltani, Saeed Mahdisoltani, Roger B. Grosse, David J. Fleet
Call for Collaboration: Renormalization for AI safety— Lauren Greenspan
On the creation of narrow AI: hierarchy and nonlocality of neural network skills— Eric J. Michaud, Asher Parker-Sartori, Max Tegmark
Harmonic Loss Trains Interpretable AI Models— David D. Baek, Ziming Liu, Riya Tyagi, Max Tegmark
Extracting memorized pieces of (copyrighted) books from open-weight language models— A. Feder Cooper, Aaron Gokaslan, Ahmed Ahmed, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang