Shallow Review of Technical AI Safety, 2025

Sandbagging evals

Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.

Theory of Change:If models can distinguish between evaluation and deployment contexts ("evaluation awareness"), they might learn to "sandbag" or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.

General Approach:Behavioral

Target Case:Pessimistic

Orthodox Problems:

7.Superintelligence can fool human supervisors 8.Superintelligence can hack software supervisors

See Also:

AI deception evals, Situational awareness and self-awareness evals, Various Redteams

Estimated FTEs:10-50

Critiques:

The main external critique, from sources like "the void" and "Lessons from a Chimp", is that this research "overattribut[es] human traits" to models. It argues that what's being measured isn't genuine sandbagging but models "playing-along-with-drama behaviour" in response to "artificial and contrived" evals.

Outputs:

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models— Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Jacob Haimes, Felix Hofstätter, Teun van der Weij

Sandbagging in a Simple Survival Bandit Problem— Joel Dyer, Daniel Jarne Ornia, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs— Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping

AI Sandbagging: Language Models can Strategically Underperform on Evaluations— Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward

Automated Researchers Can Subtly Sandbag— Johannes Gasteiger, Vladimir Mikulik, Ethan Perez, Fabien Roger, Misha Wagner, Akbir Khan, Sam Bowman, Jan Leike

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring— Chloe Li, Mary Phuong, Noah Y. Siegel

White Box Control at UK AISI - Update on Sandbagging Investigations— Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking— Buck Shlegeris, Julian Stastny

Won't vs. Can't: Sandbagging-like Behavior from Claude Models