Sandbagging evals
Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.
Theory of Change:If models can distinguish between evaluation and deployment contexts ("evaluation awareness"), they might learn to "sandbag" or deliberately underperform to hide dangerous capabilities, fooling safety evaluations. By developing evaluations for sandbagging, we can test whether our safety methods are being deceived and detect this behavior before a model is deployed.
General Approach:Behavioral
Target Case:Pessimistic
Estimated FTEs:10-50
Critiques:
The main external critique, from sources like "the void" and "Lessons from a Chimp", is that this research "overattribut[es] human traits" to models. It argues that what's being measured isn't genuine sandbagging but models "playing-along-with-drama behaviour" in response to "artificial and contrived" evals.
Outputs:
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models— Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Jacob Haimes, Felix Hofstätter, Teun van der Weij
Sandbagging in a Simple Survival Bandit Problem— Joel Dyer, Daniel Jarne Ornia, Nicholas Bishop, Anisoara Calinescu, Michael Wooldridge
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs— Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
AI Sandbagging: Language Models can Strategically Underperform on Evaluations— Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward
Automated Researchers Can Subtly Sandbag— Johannes Gasteiger, Vladimir Mikulik, Ethan Perez, Fabien Roger, Misha Wagner, Akbir Khan, Sam Bowman, Jan Leike
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring— Chloe Li, Mary Phuong, Noah Y. Siegel
White Box Control at UK AISI - Update on Sandbagging Investigations— Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking— Buck Shlegeris, Julian Stastny