Shallow Review of Technical AI Safety, 2025

Monitoring concepts

Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.

Theory of Change:By mapping internal activations to human-interpretable concepts, we can detect dangerous capabilities or deceptive alignment directly in the mind of the model even if its overt behavior is perfectly safe. Deploy computationally cheap monitors to flag some hidden misalignment in deployed systems.

General Approach:Cognitive

Target Case:Pessimistic

Orthodox Problems:

1.Value is fragile and hard to specify 4.Goals misgeneralize out of distribution 12.A boxed AGI might exfiltrate itself

See Also:

Pragmatic interp, Reverse engineering, Sparse Coding, Model diffing

Some names:Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Tom Wollschläger, Anna Soligo, Jack Lindsey, Brian Christian, Ling Hu, Nicholas Goldowsky-Dill, Neel Nanda

Estimated FTEs:50-100

Critiques:

Exploring the generalization of LLM truth directions on conversational formats, Understanding (Un)Reliability of Steering Vectors in Language Models

Outputs:

Convergent Linear Representations of Emergent Misalignment— Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

Detecting Strategic Deception Using Linear Probes— Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn

Toward universal steering and monitoring of AI models— Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin

Reward Model Interpretability via Optimal and Pessimal Tokens— Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence— Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger

Cost-Effective Constitutional Classifiers via Representation Re-use— Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma

Refusal in LLMs is an Affine Function— Thomas Marshall, Adam Scherlis, Nora Belrose

White Box Control at UK AISI - Update on Sandbagging Investigations— Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

Here's 18 Applications of Deception Probes— Cleo Nardo, Avi Parrack, jordine

How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations— Brandon Jaipersaud, David Krueger, Ekdeep Singh Lubana

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models— James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez