Shallow Review of Technical AI Safety, 2025

Google Deepmind

Structure:research laboratory subsidiary of a for-profit
Safety teams:

amplified oversight, interpretability, ASAT eng (automated alignment research), Causal Incentives Working Group, Frontier Safety Risk Assessment (evals, threat models, the framework), Mitigations (e.g. banning accounts, refusal training, jailbreak robustness), Loss of Control (control, alignment evals). Structure here.

See Also:
Some names:Rohin Shah, Anca Dragan, Alex Turner, Jonah Brown-Cohen, Neel Nanda, Erik Jenner
Outputs:
A Pragmatic Vision for InterpretabilityNeel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith
How Can Interpretability Researchers Help AGI Go Well?Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith
Evaluating Frontier Models for Stealth and Situational AwarenessMary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah
When Chain of Thought is Necessary, Language Models Struggle to Evade MonitorsScott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
MONA: Managed Myopia with Approval FeedbackSebastian Farquhar, David Lindner, Rohin Shah
Consistency Training Helps Stop Sycophancy and JailbreaksAlex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
An Approach to Technical AGI Safety and SecurityRohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Samuel Albanie, Scott Emmons, Sebastian Farquhar, Sébastien Krier, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Vladimir Mikulik, Zachary Kenton, Dave Orr, Shane Legg, Noah Goodman, Allan Dafoe, Four Flynn, Anca Dragan
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda
Steering Gemini Using BIDPO VectorsAlex Turner, Mark Kurzeja, Dave Orr, David Elson
Difficulties with Evaluating a Deception Detector for AIsLewis Smith, Bilal Chughtai, Neel Nanda
Taking a responsible path to AGIAnca Dragan, Rohin Shah, Four Flynn, Shane Legg
Evaluating potential cybersecurity threats of advanced AIFour Flynn, Mikel Rodriguez, Raluca Ada Popa
A Pragmatic Way to Measure Chain-of-Thought MonitorabilityScott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah