Shallow Review of Technical AI Safety, 2025

OpenAI

Structure:public benefit corp
Safety teams:

Alignment, Safety Systems (Interpretability, Safety Oversight, Pretraining Safety, Robustness, Safety Research, Trustworthy AI, new Misalignment Research team coming), Preparedness, Model Policy, Safety and Security Committee, Safety Advisory Group. The Persona Features paper had a distinct author list. No named successor to Superalignment.

Public alignment agenda:None. Boaz Barak offers personal views.
Some names:Johannes Heidecke, Boaz Barak, Mia Glaese, Jenny Nitishinskaya, Lama Ahmad, Naomi Bashkansky, Miles Wang, Wojciech Zaremba, David Robinson, Zico Kolter, Jerry Tworek, Eric Wallace, Olivia Watkins, Kai Chen, Chris Koch, Andrea Vallone, Leo Gao
Critiques:
Stein-Perlman, Stewart, underelicitation, Midas, defense, Carlsmith on labs in general. It's difficult to model OpenAI as a single agent: "ALTMAN: I very rarely get to have anybody work on anything. One thing about researchers is they're going to work on what they're going to work on, and that's that."
Outputs:
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting ObfuscationBowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi
Persona Features Control Emergent MisalignmentMiles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
Stress Testing Deliberative Alignment for Anti-Scheming TrainingBronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn
Deliberative Alignment: Reasoning Enables Safer Language ModelsMelody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese
Toward understanding and preventing misalignment generalizationMiles Wang, Tom Dupré la Tour, Olivia Watkins, Aleksandar Makelov, Ryan A. Chi, Samuel Miserendino, Tejal Patwardhan, Dan Mossing
Our updated Preparedness FrameworkOpenAI Preparedness Team
Trading Inference-Time Compute for Adversarial RobustnessWojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese