Emergent misalignment
Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.
Theory of Change:Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.
General Approach:Behavioral
Target Case:Pessimistic
See Also:
auditing real models, Pragmatic interpretability
Some names:Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans
Estimated FTEs:10-50
Outputs:
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs— Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models— James Chua, Jan Betley, Mia Taylor, Owain Evans
Persona Features Control Emergent Misalignment— Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Model Organisms for Emergent Misalignment— Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs— Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans
Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data— Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
Convergent Linear Representations of Emergent Misalignment— Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda
Narrow Misalignment is Hard, Emergent Misalignment is Easy— Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
Aesthetic Preferences Can Cause Emergent Misalignment— Anders Woodruff
Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences— Batu El, James Zou
Emergent Misalignment & Realignment— Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel
Selective Generalization: Improving Capabilities While Maintaining Alignment— Ariana Azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, Alex Cloud
Emergent Misalignment on a Budget— Valerio Pepe, Armaan Tipirneni
The Rise of Parasitic AI— Adele Lopez
Open problems in emergent misalignment— Jan Betley, Daniel Tan