Shallow Review of Technical AI Safety, 2025

Sandbagging evals

Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.
Theory of Change:If models can distinguish between evaluation and deployment contexts ("evaluation awareness"), they might learn to "sandbag" or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.
General Approach:Behavioral
Target Case:Pessimistic
Estimated FTEs:10-50
Critiques:
The main external critique, from sources like "the void" and "Lessons from a Chimp", is that this research "overattribut[es] human traits" to models. It argues that what's being measured isn't genuine sandbagging but models "playing-along-with-drama behaviour" in response to "artificial and contrived" evals.
Outputs:
Noise Injection Reveals Hidden Capabilities of Sandbagging Language ModelsCameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Jacob Haimes, Felix Hofstätter, Teun van der Weij
Sandbagging in a Simple Survival Bandit ProblemJoel Dyer, Daniel Jarne Ornia, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMsAlexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
AI Sandbagging: Language Models can Strategically Underperform on EvaluationsTeun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward
Automated Researchers Can Subtly SandbagJohannes Gasteiger, Vladimir Mikulik, Ethan Perez, Fabien Roger, Misha Wagner, Akbir Khan, Sam Bowman, Jan Leike
White Box Control at UK AISI - Update on Sandbagging InvestigationsJoseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney