Shallow Review of Technical AI Safety, 2025

Emergent misalignment

Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.
Theory of Change:Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.
General Approach:Behavioral
Target Case:Pessimistic
See Also:
auditing real models, Pragmatic interpretability
Some names:Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans
Estimated FTEs:10-50
Outputs:
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMsJan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
Thought Crime: Backdoors and Emergent Misalignment in Reasoning ModelsJames Chua, Jan Betley, Mia Taylor, Owain Evans
Persona Features Control Emergent MisalignmentMiles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Model Organisms for Emergent MisalignmentEdward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMsMia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans
Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in DataAlex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
Convergent Linear Representations of Emergent MisalignmentAnna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda
Narrow Misalignment is Hard, Emergent Misalignment is EasyEdward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
Emergent Misalignment & RealignmentElizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel
Selective Generalization: Improving Capabilities While Maintaining AlignmentAriana Azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, Alex Cloud
Emergent Misalignment on a BudgetValerio Pepe, Armaan Tipirneni
Open problems in emergent misalignmentJan Betley, Daniel Tan