Shallow Review of Technical AI Safety, 2025

Anthropic

Structure:public-benefit corp
Safety teams:

Scalable Alignment (Leike), Alignment Evals (Bowman), Interpretability (Olah), Control (Perez), Model Psychiatry (Lindsey), Character (Askell), Alignment Stress-Testing (Hubinger), Alignment Mitigations (Price?), Frontier Red Team (Graham), Safeguards (?), Societal Impacts (Ganguli), Trust and Safety (Sanderford), Model Welfare (Fish)

Public alignment agenda:directions, bumpers, checklist, an old vague view
Framework:RSP
See Also:
Some names:Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Ethan Perez, Jack Lindsey, Amanda Askell
Outputs:
Evaluating honesty and lie detection techniques on a diverse suite of dishonest modelsRowan Wang, Johannes Treutlein, Fabien Roger, Evan Hubinger, Sam Marks
Agentic Misalignment: How LLMs could be insider threatsAengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, Ethan Perez, Evan Hubinger
Why Do Some Language Models Fake Alignment While Others Don't?abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
Forecasting Rare Language Model BehaviorsErik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
Findings from a Pilot Anthropic—OpenAI Alignment Evaluation ExerciseSamuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, Ethan Perez, Nicholas Carlini
On the Biology of a Large Language ModelJack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
Poisoning Attacks on LLMs Require a Near-constant Number of Poison SamplesAlexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk
Circuit Tracing: Revealing Computational Graphs in Language ModelsEmmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
SHADE-Arena: Evaluating sabotage and monitoring in LLM agentsXiang Deng, Chen Bo Calvin Zhang, Tyler Tracy, Buck Shlegeris, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Henry Sleight
Open-sourcing circuit tracing toolsMichael Hanna, Mateusz Piotrowski, Emmanuel Ameisen, Jack Lindsey, Johnny Lin, Curt Tigges