Shallow Review of Technical AI Safety, 2025

Other interpretability

Interpretability that does not fall well into other categories.
Theory of Change:Explore alternative conceptual frameworks (e.g., agentic, propositional) and physics-inspired methods (e.g., renormalization). Or be "pragmatic".
Some names:Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau
Estimated FTEs:30-60
Outputs:
Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretabilitysubmarat, Joachim Schaeffer, Luca Baroni, galvsk, StefanHex
Open Problems in Mechanistic InterpretabilityLee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
Opportunity Space: Renormalization for AI SafetyLauren Greenspan, Dmitry Vaintrob, Lucas Teixeira
Language Models May Verbatim Complete Text They Were Not Explicitly Trained OnKen Ziyu Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive SurveyYunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, Xuming Hu
Through a Steerable Lens: Magnifying Neural Network Interpretability via Phase-Based ExtrapolationFarzaneh Mahdisoltani, Saeed Mahdisoltani, Roger B. Grosse, David J. Fleet
On the creation of narrow AI: hierarchy and nonlocality of neural network skillsEric J. Michaud, Asher Parker-Sartori, Max Tegmark
Harmonic Loss Trains Interpretable AI ModelsDavid D. Baek, Ziming Liu, Riya Tyagi, Max Tegmark
Extracting memorized pieces of (copyrighted) books from open-weight language modelsA. Feder Cooper, Aaron Gokaslan, Ahmed Ahmed, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang