Shallow Review of Technical AI Safety, 2025

Sparse Coding

Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.
Theory of Change:Get a principled decomposition of an LLM's activation into atomic components → identify deception and other misbehaviors.
Target Case:Average Case
Some names:Leo Gao, Dan Mossing, Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Thomas Heap, Abhinav Menon, Kenny Peng, Tim Lawson
Estimated FTEs:50-100
Outputs:
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-TuningJulian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language ModelsJavier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda
Circuit Tracing: Revealing Computational Graphs in Language ModelsEmmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
Sparse Autoencoders Learn Monosemantic Features in Vision-Language ModelsMateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse AutoencodersAndrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
Sparse Autoencoders Do Not Find Canonical Units of AnalysisPatrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda
Transcoders Beat Sparse Autoencoders for InterpretabilityGonçalo Paulo, Stepan Shabalin, Nora Belrose
CRISP: Persistent Concept Unlearning via Sparse AutoencodersTomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov
The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMsOmar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana
Scaling sparse feature circuit finding for in-context learningDmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda
Learning Multi-Level Features with Matryoshka Sparse AutoencodersBart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda
Are Sparse Autoencoders Useful? A Case Study in Sparse ProbingSubhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
Priors in Time: Missing Inductive Biases for Language Model InterpretabilityEkdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller
Binary Sparse Coding for InterpretabilityLucia Quirke, Stepan Shabalin, Nora Belrose
Scaling Sparse Feature Circuit Finding to Gemma 9BDiego Caples, Jatin Nainani, CallumMcDougall, rrenaud
Dense SAE Latents Are Features, Not BugsXiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark
Evaluating Sparse Autoencoders on Targeted Concept Erasure TasksAdam Karvonen, Can Rager, Samuel Marks, Neel Nanda
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model InterpretabilityAdam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
SAEs Are Good for Steering -- If You Select the Right FeaturesDana Arad, Aaron Mueller, Yonatan Belinkov
Line of Sight: On Linear Representations in VLLMsAchyuta Rajaram, Sarah Schwettmann, Jacob Andreas, Arthur Conmy
Low-Rank Adapting Models for Sparse AutoencodersMatthew Chen, Joshua Engels, Max Tegmark
Enhancing Automated Interpretability with Output-Centric Feature DescriptionsYoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva
Enhancing Neural Network Interpretability with Feature-Aligned Sparse AutoencodersLuke Marks, Alasdair Paren, David Krueger, Fazl Barez
BatchTopK Sparse AutoencodersBart Bussmann, Patrick Leask, Neel Nanda
Internal states before wait modulate reasoning patternsDmitrii Troitskii, Koyena Pal, Chris Wendler, Callum Stuart McDougall, Neel Nanda
Do Sparse Autoencoders Generalize? A Case Study of AnswerabilityLovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEsXiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
How Visual Representations Map to Language Feature Space in Multimodal LLMsConstantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and VideoSonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, Blake Aaron Richards
Interpreting the linear structure of vision-language model embedding spacesIsabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil
Interpreting Large Text-to-Image Diffusion Models with Dictionary LearningStepan Shabalin, Ayush Panda, Dmitrii Kharlapenko, Abdur Raheem Ali, Yixiong Hao, Arthur Conmy