Control
If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?
Target Case:Worst Case
See Also:
safety cases
Some names:UK AISI, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
Estimated FTEs:5-50
Outputs:
Ctrl-Z: Controlling AI Agents via Resampling— Aryan Bhatt, Buck Shlegeris, Adam Kaufman, Cody Rushing, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents— Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats— Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models— Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?— Alex Mallen, Charlie Griffin, Misha Wagner, Alessandro Abate, Buck Shlegeris
Evaluating Control Protocols for Untrusted AI Agents— Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, Joe Benton
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability— Artur Zolkowski, Wen Xing, David Lindner, Florian Tramèr, Erik Jenner
Optimizing AI Agent Attacks With Synthetic Data— Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols— Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate
A sketch of an AI control safety case— Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving
Assessing confidence in frontier AI safety cases— Stephen Barrett, Philip Fox, Joshua Krook, Tuneer Mondal, Simon Mylius, Alejandro Tlaie
ControlArena— Rogan Inglis, Ollie Matthews, Tyler Tracy, Oliver Makins, Tom Catling, Asa Cooper Stickland, Rasmus Faber-Espensen, Daniel O'Connell, Myles Heller, Miguel Brandao, Adam Hanson, Arathi Mani, Tomek Korbak, Jan Michelfeit, Dishank Bansal, Tomas Bark, Chris Canal, Charlie Griffin, Jasmine Wang, Alan Cooney
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence— Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving
The Alignment Project by UK AISI— Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa, Edmund Lau
Towards evaluations-based safety cases for AI scheming— Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq
Incentives for Responsiveness, Instrumental Control and Impact— Ryan Carey, Eric Langlois, Chris van Merwijk, Shane Legg, Tom Everitt
AI companies are unlikely to make high-assurance safety cases if timelines are short— Ryan Greenblatt
Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework— Rishane Dassanayake, Mario Demetroudi, James Walpole, Lindley Lentati, Jason R. Brown, Edward James Young
Dynamic safety cases for frontier AI— Carmen Cârlan, Francesca Gomez, Yohan Mathew, Ketana Krishna, René King, Peter Gebauer, Ben R. Smith
Takeaways from sketching a control safety case— Josh Clymer, Buck Shlegeris