Emergent misalignment

Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.

Theory of Change:Predict, detect, and prevent models from developing broadly harmful behaviors (like deception or shutdown resistance) when fine-tuned on seemingly unrelated tasks. Find, preserve, and robustify this correlated representation of the good.

General Approach:Behavioral

Target Case:Pessimistic

Orthodox Problems:

4.Goals misgeneralize out of distribution 7.Superintelligence can fool human supervisors

See Also:

auditing real models, Pragmatic interpretability

Some names:Truthful AI, Jan Betley, James Chua, Mia Taylor, Miles Wang, Edward Turner, Anna Soligo, Alex Cloud, Nathan Hu, Owain Evans

Estimated FTEs:10-50

Critiques:

Emergent Misalignment as Prompt Sensitivity, Go home GPT-4o, you're drunk

Outputs:

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs— Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models— James Chua, Jan Betley, Mia Taylor, Owain Evans

Persona Features Control Emergent Misalignment— Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing

Model Organisms for Emergent Misalignment— Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs— Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data— Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans

Convergent Linear Representations of Emergent Misalignment— Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

Narrow Misalignment is Hard, Emergent Misalignment is Easy— Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

Aesthetic Preferences Can Cause Emergent Misalignment— Anders Woodruff

Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences— Batu El, James Zou

Emergent Misalignment & Realignment— Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel

Realistic Reward Hacking Induces Different and Deeper Misalignment— Jozdien

Selective Generalization: Improving Capabilities While Maintaining Alignment— Ariana Azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor, Alex Cloud

Emergent Misalignment on a Budget— Valerio Pepe, Armaan Tipirneni

The Rise of Parasitic AI— Adele Lopez

LLM AGI may reason about its goals and discover misalignments by default— Seth Herd

Open problems in emergent misalignment— Jan Betley, Daniel Tan