Monitoring concepts
Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.
Theory of Change:By mapping internal activations to human-interpretable concepts, we can detect dangerous capabilities or deceptive alignment directly in the mind of the model even if its overt behavior is perfectly safe. Deploy computationally cheap monitors to flag some hidden misalignment in deployed systems.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Tom Wollschläger, Anna Soligo, Jack Lindsey, Brian Christian, Ling Hu, Nicholas Goldowsky-Dill, Neel Nanda
Estimated FTEs:50-100
Outputs:
Convergent Linear Representations of Emergent Misalignment— Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda
Detecting Strategic Deception Using Linear Probes— Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn
Toward universal steering and monitoring of AI models— Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin
Reward Model Interpretability via Optimal and Pessimal Tokens— Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence— Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger
Cost-Effective Constitutional Classifiers via Representation Re-use— Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma
Refusal in LLMs is an Affine Function— Thomas Marshall, Adam Scherlis, Nora Belrose
White Box Control at UK AISI - Update on Sandbagging Investigations— Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
Here's 18 Applications of Deception Probes— Cleo Nardo, Avi Parrack, jordine
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations— Brandon Jaipersaud, David Krueger, Ekdeep Singh Lubana
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models— James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez