Google Deepmind
Structure:research laboratory subsidiary of a for-profit
Safety teams:
amplified oversight, interpretability, ASAT eng (automated alignment research), Causal Incentives Working Group, Frontier Safety Risk Assessment (evals, threat models, the framework), Mitigations (e.g. banning accounts, refusal training, jailbreak robustness), Loss of Control (control, alignment evals). Structure here.
Public alignment agenda:An Approach to Technical AGI Safety and Security
Framework:Frontier Safety Framework
See Also:
White-box safety (i.e. Interpretability), Scalable Oversight
Some names:Rohin Shah, Anca Dragan, Alex Turner, Jonah Brown-Cohen, Neel Nanda, Erik Jenner
Critiques:
Stein-Perlman, Carlsmith on labs in general, underelicitation, On Google's Safety Plan
Outputs:
A Pragmatic Vision for Interpretability— Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith
How Can Interpretability Researchers Help AGI Go Well?— Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith
Evaluating Frontier Models for Stealth and Situational Awareness— Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors— Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
MONA: Managed Myopia with Approval Feedback— Sebastian Farquhar, David Lindner, Rohin Shah
Consistency Training Helps Stop Sycophancy and Jailbreaks— Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah
An Approach to Technical AGI Safety and Security— Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Samuel Albanie, Scott Emmons, Sebastian Farquhar, Sébastien Krier, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Vladimir Mikulik, Zachary Kenton, Dave Orr, Shane Legg, Noah Goodman, Allan Dafoe, Four Flynn, Anca Dragan
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)— Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda
Steering Gemini Using BIDPO Vectors— Alex Turner, Mark Kurzeja, Dave Orr, David Elson
Difficulties with Evaluating a Deception Detector for AIs— Lewis Smith, Bilal Chughtai, Neel Nanda
Taking a responsible path to AGI— Anca Dragan, Rohin Shah, Four Flynn, Shane Legg
Evaluating potential cybersecurity threats of advanced AI— Four Flynn, Mikel Rodriguez, Raluca Ada Popa
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance— Senthooran Rajamanoharan, Neel Nanda
A Pragmatic Way to Measure Chain-of-Thought Monitorability— Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah