Situational awareness and self-awareness evals

Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.

Theory of Change:If an AI can distinguish between evaluation and deployment ("evaluation awareness"), it might hide dangerous capabilities (scheming/sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.

General Approach:Behavioral

Target Case:Worst Case

Orthodox Problems:

7.Superintelligence can fool human supervisors 8.Superintelligence can hack software supervisors

Some names:Jan Betley

Estimated FTEs:30-70

Critiques:

Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language, It's hard to make scheming evals look realistic for LLMs

Outputs:

AI Awareness— Xiaojian Li, Haoyuan Shi, Rongwu Xu, Wei Xu

Tell me about yourself: LLMs are aware of their learned behaviors— Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans

Evaluating Frontier Models for Stealth and Situational Awareness— Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah

Large Language Models Often Know When They Are Being Evaluated— Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings— Casey Barkan, Sid Black, Oliver Sourbut

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness— Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, Kevin Zhu

Claude Sonnet 3.7 (often) knows when it's in alignment evaluations— Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn

It's hard to make scheming evals look realistic for LLMs— Igor Ivanov, Danil Kadochnikov

Know Thyself? On the Incapability and Implications of AI Self-Recognition— Xiaoyan Bai, Aryan Shrivastava, Ari Holtzman, Chenhao Tan

Chain-of-Thought Snippets — Anti-Scheming

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs— Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland