Sparse Coding
Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.
Theory of Change:Get a principled decomposition of an LLM's activation into atomic components → identify deception and other misbehaviors.
Target Case:Average Case
Some names:Leo Gao, Dan Mossing, Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Thomas Heap, Abhinav Menon, Kenny Peng, Tim Lawson
Estimated FTEs:50-100
Critiques:
Sparse Autoencoders Can Interpret Randomly Initialized Transformers, The Sparse Autoencoders bubble has popped, but they are still promising, Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research, Sparse Autoencoders Trained on the Same Data Learn Different Features, Why Not Just Train For Interpretability?
Outputs:
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning— Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models— Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda
Circuit Tracing: Revealing Computational Graphs in Language Models— Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models— Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders— Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
Sparse Autoencoders Do Not Find Canonical Units of Analysis— Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda
Transcoders Beat Sparse Autoencoders for Interpretability— Gonçalo Paulo, Stepan Shabalin, Nora Belrose
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization— Or Shafran, Atticus Geiger, Mor Geva
CRISP: Persistent Concept Unlearning via Sparse Autoencoders— Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov
The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs— Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana
Scaling sparse feature circuit finding for in-context learning— Dmitrii Kharlapenko, Stepan Shabalin, Fazl Barez, Arthur Conmy, Neel Nanda
Learning Multi-Level Features with Matryoshka Sparse Autoencoders— Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing— Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
Sparse Autoencoders Trained on the Same Data Learn Different Features— Gonçalo Paulo, Nora Belrose
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data— Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson
Priors in Time: Missing Inductive Biases for Language Model Interpretability— Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models— Patrick Leask, Neel Nanda, Noura Al Moubayed
Binary Sparse Coding for Interpretability— Lucia Quirke, Stepan Shabalin, Nora Belrose
Scaling Sparse Feature Circuit Finding to Gemma 9B— Diego Caples, Jatin Nainani, CallumMcDougall, rrenaud
Partially Rewriting a Transformer in Natural Language— Gonçalo Paulo, Nora Belrose
Dense SAE Latents Are Features, Not Bugs— Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks— Adam Karvonen, Can Rager, Samuel Marks, Neel Nanda
Evaluating SAE interpretability without explanations— Gonçalo Paulo, Nora Belrose
SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs— Aashiq Muhamed, Jacopo Bonato, Mona Diab, Virginia Smith
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability— Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
SAEs Are Good for Steering -- If You Select the Right Features— Dana Arad, Aaron Mueller, Yonatan Belinkov
Line of Sight: On Linear Representations in VLLMs— Achyuta Rajaram, Sarah Schwettmann, Jacob Andreas, Arthur Conmy
Low-Rank Adapting Models for Sparse Autoencoders— Matthew Chen, Joshua Engels, Max Tegmark
Enhancing Automated Interpretability with Output-Centric Feature Descriptions— Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models— Aashiq Muhamed, Mona Diab, Virginia Smith
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders— Luke Marks, Alasdair Paren, David Krueger, Fazl Barez
BatchTopK Sparse Autoencoders— Bart Bussmann, Patrick Leask, Neel Nanda
Towards Understanding Distilled Reasoning Models: A Representational Approach— David D. Baek, Max Tegmark
Understanding sparse autoencoder scaling in the presence of feature manifolds— Eric J. Michaud, Liv Gorton, Tom McGrath
Internal states before wait modulate reasoning patterns— Dmitrii Troitskii, Koyena Pal, Chris Wendler, Callum Stuart McDougall, Neel Nanda
Do Sparse Autoencoders Generalize? A Case Study of Answerability— Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs— Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
How Visual Representations Map to Language Feature Space in Multimodal LLMs— Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video— Sonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, Blake Aaron Richards
Topological Data Analysis and Mechanistic Interpretability— Gunnar Carlsson
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages— Jannik Brinkmann, Chris Wendler, Christian Bartelt, Aaron Mueller
Interpreting the linear structure of vision-language model embedding spaces— Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil
Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning— Stepan Shabalin, Ayush Panda, Dmitrii Kharlapenko, Abdur Raheem Ali, Yixiong Hao, Arthur Conmy