AI scheming evals
Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.
Theory of Change:Robust evaluations must move beyond checking final outputs and probe the model's reasoning to verify that alignment is genuine, not faked, because capable models are capable of strategically concealing misaligned goals (scheming) to pass standard behavioural evaluations.
Target Case:Pessimistic
Orthodox Problems:
Some names:Rohin Shah, Mikita Balesni, Wojciech Zaremba
Estimated FTEs:30-60
Critiques:
Outputs:
Stress Testing Deliberative Alignment for Anti-Scheming Training— Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn
Frontier Models are Capable of In-context Scheming— Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, Marius Hobbhahn
Agentic Misalignment: How LLMs Could Be Insider Threats— Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, Kevin Troy