Character training and persona steering
Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).
Theory of Change:If post-training, prompting, and activation-engineering interact with some kind of structured 'persona space', then better understanding it should benefit the design, control, and detection of LLM personas.
General Approach:Cognitive
Target Case:Average Case
Orthodox Problems:
See Also:
Some names:Truthful AI, Amanda Askell, Jack Lindsey, Theia Vogel, Sharan Maiya, Evan Hubinger
Critiques:
Outputs:
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI— Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger
On the functional self of LLMs— eggsyntax
Claude 4.5 Opus' Soul Document— Richard Weiss
Persona Features Control Emergent Misalignment— Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time— Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
Persona Vectors: Monitoring and Controlling Character Traits in Language Models— Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Reducing LLM deception at scale with self-other overlap fine-tuning— Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana, Trent Hodgeson
The Rise of Parasitic AI— Adele Lopez
A Three-Layer Model of LLM Psychology— Jan_Kulveit
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models— Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger
Selection Pressures on LM Personas— Raymond Douglas
the void— nostalgebraist
void miscellany— nostalgebraist