Shallow Review of Technical AI Safety, 2025

Situational awareness and self-awareness evals

Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Theory of Change:If an AI can distinguish between evaluation and deployment ("evaluation awareness"), it might hide dangerous capabilities (scheming/sandbagging). By measuring self- and situational-awareness, we can better assess this risk and build more robust evaluations.
General Approach:Behavioral
Target Case:Worst Case
Some names:Jan Betley
Estimated FTEs:30-70
Outputs:
AI AwarenessXiaojian Li, Haoyuan Shi, Rongwu Xu, Wei Xu
Tell me about yourself: LLMs are aware of their learned behaviorsJan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
Evaluating Frontier Models for Stealth and Situational AwarenessMary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah
Large Language Models Often Know When They Are Being EvaluatedJoe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation AwarenessLang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, Kevin Zhu
Claude Sonnet 3.7 (often) knows when it's in alignment evaluationsNicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
Know Thyself? On the Incapability and Implications of AI Self-RecognitionXiaoyan Bai, Aryan Shrivastava, Ari Holtzman, Chenhao Tan
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMsSara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland