Shallow Review of Technical AI Safety, 2025

Inoculation prompting

Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.
Theory of Change:LLMs don't seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don't want. Maybe assume that thoughts are translucent.
General Approach:Engineering
Target Case:Average Case
Some names:Alex Turner, Alex Cloud
Outputs:
Recontextualization Mitigates Specification Gaming Without Modifying the SpecificationAriana Azarbal, Victor Gillioz, Alexander Matt Turner, Alex Cloud
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-timeDaniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignmentNevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks