Shallow Review of Technical AI Safety, 2025

Monitoring concepts

Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.
Theory of Change:By mapping internal activations to human-interpretable concepts, we can detect dangerous capabilities or deceptive alignment directly in the mind of the model even if its overt behavior is perfectly safe. Deploy computationally cheap monitors to flag some hidden misalignment in deployed systems.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Tom Wollschläger, Anna Soligo, Jack Lindsey, Brian Christian, Ling Hu, Nicholas Goldowsky-Dill, Neel Nanda
Estimated FTEs:50-100
Outputs:
Convergent Linear Representations of Emergent MisalignmentAnna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda
Detecting Strategic Deception Using Linear ProbesNicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn
Toward universal steering and monitoring of AI modelsDaniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin
Reward Model Interpretability via Optimal and Pessimal TokensBrian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska
The Geometry of Refusal in Large Language Models: Concept Cones and Representational IndependenceTom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger
Cost-Effective Constitutional Classifiers via Representation Re-useHoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma
Refusal in LLMs is an Affine FunctionThomas Marshall, Adam Scherlis, Nora Belrose
White Box Control at UK AISI - Update on Sandbagging InvestigationsJoseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
Here's 18 Applications of Deception ProbesCleo Nardo, Avi Parrack, jordine
Beyond Linear Probes: Dynamic Safety Monitoring for Language ModelsJames Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez