Shallow Review of Technical AI Safety, 2025

Character training and persona steering

Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).
Theory of Change:If post-training, prompting, and activation-engineering interact with some kind of structured 'persona space', then better understanding it should benefit the design, control, and detection of LLM personas.
General Approach:Cognitive
Target Case:Average Case
Some names:Truthful AI, Amanda Askell, Jack Lindsey, Theia Vogel, Sharan Maiya, Evan Hubinger
Critiques:
Outputs:
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AISharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger
Persona Features Control Emergent MisalignmentMiles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-timeDaniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
Persona Vectors: Monitoring and Controlling Character Traits in Language ModelsRunjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Reducing LLM deception at scale with self-other overlap fine-tuningMarc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana, Trent Hodgeson
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language ModelsLujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger
the voidnostalgebraist
void miscellanynostalgebraist