Google Deepmind

Structure:research laboratory subsidiary of a for-profit

Safety teams:

amplified oversight, interpretability, ASAT eng (automated alignment research), Causal Incentives Working Group, Frontier Safety Risk Assessment (evals, threat models, the framework), Mitigations (e.g. banning accounts, refusal training, jailbreak robustness), Loss of Control (control, alignment evals). Structure here.

Public alignment agenda:An Approach to Technical AGI Safety and Security

Framework:Frontier Safety Framework

See Also:

White-box safety (i.e. Interpretability), Scalable Oversight

Some names:Rohin Shah, Anca Dragan, Alex Turner, Jonah Brown-Cohen, Neel Nanda, Erik Jenner

Critiques:

Stein-Perlman, Carlsmith on labs in general, underelicitation, On Google's Safety Plan

Outputs:

A Pragmatic Vision for Interpretability— Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

How Can Interpretability Researchers Help AGI Go Well?— Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

Evaluating Frontier Models for Stealth and Situational Awareness— Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors— Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah

MONA: Managed Myopia with Approval Feedback— Sebastian Farquhar, David Lindner, Rohin Shah

Consistency Training Helps Stop Sycophancy and Jailbreaks— Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah

An Approach to Technical AGI Safety and Security— Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, Rishub Jain, Rory Greig, Samuel Albanie, Scott Emmons, Sebastian Farquhar, Sébastien Krier, Senthooran Rajamanoharan, Sophie Bridgers, Tobi Ijitoye, Tom Everitt, Victoria Krakovna, Vikrant Varma, Vladimir Mikulik, Zachary Kenton, Dave Orr, Shane Legg, Noah Goodman, Allan Dafoe, Four Flynn, Anca Dragan

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)— Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda

Steering Gemini Using BIDPO Vectors— Alex Turner, Mark Kurzeja, Dave Orr, David Elson

Difficulties with Evaluating a Deception Detector for AIs— Lewis Smith, Bilal Chughtai, Neel Nanda

Taking a responsible path to AGI— Anca Dragan, Rohin Shah, Four Flynn, Shane Legg

Evaluating potential cybersecurity threats of advanced AI— Four Flynn, Mikel Rodriguez, Raluca Ada Popa

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance— Senthooran Rajamanoharan, Neel Nanda

A Pragmatic Way to Measure Chain-of-Thought Monitorability— Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah