Situational awareness and self-awareness evals
Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Theory of Change:If an AI can distinguish between evaluation and deployment ("evaluation awareness"), it might hide dangerous capabilities (scheming/sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.
General Approach:Behavioral
Target Case:Worst Case
See Also:
Some names:Jan Betley
Estimated FTEs:30-70
Outputs:
AI Awareness— Xiaojian Li, Haoyuan Shi, Rongwu Xu, Wei Xu
Tell me about yourself: LLMs are aware of their learned behaviors— Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
Evaluating Frontier Models for Stealth and Situational Awareness— Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah
Large Language Models Often Know When They Are Being Evaluated— Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings— Casey Barkan, Sid Black, Oliver Sourbut
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness— Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, Kevin Zhu
Claude Sonnet 3.7 (often) knows when it's in alignment evaluations— Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
It's hard to make scheming evals look realistic for LLMs— Igor Ivanov, Danil Kadochnikov
Know Thyself? On the Incapability and Implications of AI Self-Recognition— Xiaoyan Bai, Aryan Shrivastava, Ari Holtzman, Chenhao Tan
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs— Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland