Shallow Review of Technical AI Safety, 2025

Lie and deception detectors

Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.
Theory of Change:Such detectors could flag suspicious behavior during evaluations or deployment, augment training to reduce deception, or audit models pre-deployment. Specific applications include alignment evaluations (e.g. by validating answers to introspective questions), safeguarding evaluations (catching models that "sandbag", that is, strategically underperform to pass capability tests), and large-scale deployment monitoring. An honest version of a model could also provide oversight during training or detect cases where a model behaves in ways it understands are unsafe.
General Approach:Cognitive
Target Case:Pessimistic
Some names:Sam Marks, Rowan Wang, Sharan Maiya, Stefan Heimersheim, Neel Nanda
Estimated FTEs:10-50
Critiques:
difficult to determine if behavior is strategic deception or only low level "reflexive" actions; Unclear if a model roleplaying a liar has deceptive intent. How are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)?, Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity, Herrmann, Smith and Chughtai
Outputs:
Detecting Strategic Deception Using Linear ProbesNicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn
White Box Control at UK AISI - Update on Sandbagging InvestigationsJoseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, Jacob Merizian, Alex Zelenka-Martin, Jacob Arbeid, Ben Millwood, Alan Cooney
Trusted monitoring, but with deception probes.Avi Parrack, StefanHex, Cleo Nardo
Here's 18 Applications of Deception ProbesCleo Nardo, Avi Parrack, jordine
Evaluating honesty and lie detection techniques on a diverse suite of dishonest modelsRowan Wang, Johannes Treutlein, Fabien Roger, Evan Hubinger, Sam Marks
Caught in the Act: a mechanistic approach to detecting deceptionGerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval
Detecting High-Stakes Interactions with Activation ProbesAlex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
White Box Control at UK AISI - Update on Sandbagging InvestigationsJoseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
Liars' Bench: Evaluating Lie Detectors for Language ModelsKieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks