AI deception evals

research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.

Theory of Change:proactively discover, evaluate, and understand the mechanisms of AI deception (e.g., alignment faking, manipulation, agentic deception) to prevent models from fooling human supervisors and causing harm.

Target Case:Worst Case

Orthodox Problems:

7.Superintelligence can fool human supervisors 8.Superintelligence can hack software supervisors

Estimated FTEs:30-80

Critiques:

A central criticism is that the evaluation scenarios are "artificial and contrived". the void and Lessons from a Chimp argue this research is "overattributing human traits" to models.

Outputs:

Liars' Bench: Evaluating Lie Detectors for Language Models— Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios— Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei

Why Do Some Language Models Fake Alignment While Others Don't?— Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, Fabien Roger

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions— John Hughes, Abhay Sheshadr

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models— Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL— Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine

Among Us: A Sandbox for Measuring and Detecting Agentic Deception— Satvik Golechha, Adrià Garriga-Alonso

Eliciting Secret Knowledge from Language Models— Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks

Edge Cases in AI Alignment— Florian Dietz

The MASK Evaluation

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment— Aleksandr Kedrik, Igor Ivanov

Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects— Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, Arun Vishwanath

Mistral Large 2 (123B) seems to exhibit alignment faking— Marc Carauleanu, Diogo de Lucena, Gunnar Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, Trent Hodgeson