Character training and persona steering

Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).

Theory of Change:If post-training, prompting, and activation-engineering interact with some kind of structured 'persona space', then better understanding it should benefit the design, control, and detection of LLM personas.

General Approach:Cognitive

Target Case:Average Case

Orthodox Problems:

1.Value is fragile and hard to specify

Some names:Truthful AI, Amanda Askell, Jack Lindsey, Theia Vogel, Sharan Maiya, Evan Hubinger

Critiques:

Nostalgebraist

Outputs:

Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI— Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger

On the functional self of LLMs— eggsyntax

Claude 4.5 Opus' Soul Document— Richard Weiss

Persona Features Control Emergent Misalignment— Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time— Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor

Persona Vectors: Monitoring and Controlling Character Traits in Language Models— Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey

Reducing LLM deception at scale with self-other overlap fine-tuning— Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana, Trent Hodgeson

The Rise of Parasitic AI— Adele Lopez

A Three-Layer Model of LLM Psychology— Jan_Kulveit

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models— Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger

Selection Pressures on LM Personas— Raymond Douglas

the void— nostalgebraist

void miscellany— nostalgebraist