AI deception evals
research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.
Theory of Change:proactively discover, evaluate, and understand the mechanisms of AI deception (e.g., alignment faking, manipulation, agentic deception) to prevent models from fooling human supervisors and causing harm.
Target Case:Worst Case
Estimated FTEs:30-80
Critiques:
A central criticism is that the evaluation scenarios are "artificial and contrived". the void and Lessons from a Chimp argue this research is "overattributing human traits" to models.
Outputs:
Liars' Bench: Evaluating Lie Detectors for Language Models— Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios— Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei
Why Do Some Language Models Fake Alignment While Others Don't?— Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, Fabien Roger
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions— John Hughes, Abhay Sheshadr
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models— Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL— Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine
Among Us: A Sandbox for Measuring and Detecting Agentic Deception— Satvik Golechha, Adrià Garriga-Alonso
Eliciting Secret Knowledge from Language Models— Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks
Edge Cases in AI Alignment— Florian Dietz
I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment— Aleksandr Kedrik, Igor Ivanov
Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects— Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, Arun Vishwanath
Mistral Large 2 (123B) seems to exhibit alignment faking— Marc Carauleanu, Diogo de Lucena, Gunnar Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, Trent Hodgeson