Shallow Review of Technical AI Safety, 2025

Model diffing

Understand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.
Theory of Change:By identifying the mechanistic differences between a base model and its fine-tune (e.g., after RLHF), maybe we can verify that safety behaviors are robustly "internalized" rather than superficially patched, and detect if dangerous capabilities or deceptive alignment have been introduced without needing to re-analyze the entire model. The diff is also much smaller, since most parameters don't change, which means you can use heavier methods on them.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Julian Minder, Clément Dumas, Neel Nanda, Trenton Bricken, Jack Lindsey
Estimated FTEs:10-30
Outputs:
Open Source Replication of Anthropic's Crosscoder paper for model-diffingConnor Kissane, robertzk, Arthur Conmy, Neel Nanda
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-TuningJulian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda
Persona Features Control Emergent MisalignmentMiles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Narrow Finetuning Leaves Clearly Readable Traces in Activation DifferencesJulian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda
Insights on Crosscoder Model DiffingSiddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, Christopher Olah, Thomas Henighan