Shallow Review of Technical AI Safety, 2025

Control

If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?
Target Case:Worst Case
See Also:
safety cases
Some names:UK AISI, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
Estimated FTEs:5-50
Outputs:
Ctrl-Z: Controlling AI Agents via ResamplingAryan Bhatt, Buck Shlegeris, Adam Kaufman, Cody Rushing, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM AgentsJonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton
Adaptive Deployment of Untrusted LLMs Reduces Distributed ThreatsJiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language ModelsSatyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?Alex Mallen, Charlie Griffin, Misha Wagner, Alessandro Abate, Buck Shlegeris
Evaluating Control Protocols for Untrusted AI AgentsJon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, Joe Benton
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought MonitorabilityArtur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner
Optimizing AI Agent Attacks With Synthetic DataChloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton
Games for AI Control: Models of Safety Evaluations of AI Deployment ProtocolsCharlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate
A sketch of an AI control safety caseTomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving
Assessing confidence in frontier AI safety casesStephen Barrett, Philip Fox, Joshua Krook, Tuneer Mondal, Simon Mylius, Alejandro Tlaie
ControlArenaRogan Inglis, Ollie Matthews, Tyler Tracy, Oliver Makins, Tom Catling, Asa Cooper Stickland, Rasmus Faber-Espensen, Daniel O'Connell, Myles Heller, Miguel Brandao, Adam Hanson, Arathi Mani, Tomek Korbak, Jan Michelfeit, Dishank Bansal, Tomas Bark, Chris Canal, Charlie Griffin, Jasmine Wang, Alan Cooney
How to evaluate control measures for LLM agents? A trajectory from today to superintelligenceTomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving
The Alignment Project by UK AISIMojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa, Edmund Lau
Towards evaluations-based safety cases for AI schemingMikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq
Incentives for Responsiveness, Instrumental Control and ImpactRyan Carey, Eric Langlois, Chris van Merwijk, Shane Legg, Tom Everitt
Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case FrameworkRishane Dassanayake, Mario Demetroudi, James Walpole, Lindley Lentati, Jason R. Brown, Edward James Young
Dynamic safety cases for frontier AICarmen Cârlan, Francesca Gomez, Yohan Mathew, Ketana Krishna, René King, Peter Gebauer, Ben R. Smith
Takeaways from sketching a control safety caseJosh Clymer, Buck Shlegeris