Model diffing
Understand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.
Theory of Change:By identifying the mechanistic differences between a base model and its fine-tune (e.g., after RLHF), maybe we can verify that safety behaviors are robustly "internalized" rather than superficially patched, and detect if dangerous capabilities or deceptive alignment have been introduced without needing to re-analyze the entire model. The diff is also much smaller, since most parameters don't change, which means you can use heavier methods on them.
General Approach:Cognitive
Target Case:Pessimistic
Orthodox Problems:
See Also:
Some names:Julian Minder, Clément Dumas, Neel Nanda, Trenton Bricken, Jack Lindsey
Estimated FTEs:10-30
Outputs:
What We Learned Trying to Diff Base and Chat Models (And Why It Matters)— Clément Dumas, Julian Minder, Neel Nanda
Open Source Replication of Anthropic's Crosscoder paper for model-diffing— Connor Kissane, robertzk, Arthur Conmy, Neel Nanda
Discovering Undesired Rare Behaviors via Model Diff Amplification— Santiago Aranguri, Thomas McGrath
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning— Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda
Persona Features Control Emergent Misalignment— Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences— Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda
Insights on Crosscoder Model Diffing— Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, Christopher Olah, Thomas Henighan