Anthropic
Structure:public-benefit corp
Safety teams:
Scalable Alignment (Leike), Alignment Evals (Bowman), Interpretability (Olah), Control (Perez), Model Psychiatry (Lindsey), Character (Askell), Alignment Stress-Testing (Hubinger), Alignment Mitigations (Price?), Frontier Red Team (Graham), Safeguards (?), Societal Impacts (Ganguli), Trust and Safety (Sanderford), Model Welfare (Fish)
Public alignment agenda:directions, bumpers, checklist, an old vague view
Framework:RSP
See Also:
White-box safety (i.e. Interpretability), Scalable Oversight
Some names:Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Ethan Perez, Jack Lindsey, Amanda Askell
Outputs:
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models— Rowan Wang, Johannes Treutlein, Fabien Roger, Evan Hubinger, Sam Marks
Agentic Misalignment: How LLMs could be insider threats— Aengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, Ethan Perez, Evan Hubinger
Why Do Some Language Models Fake Alignment While Others Don't?— abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
Forecasting Rare Language Model Behaviors— Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise— Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, Ethan Perez, Nicholas Carlini
On the Biology of a Large Language Model— Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples— Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk
Circuit Tracing: Revealing Computational Graphs in Language Models— Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
SHADE-Arena: Evaluating sabotage and monitoring in LLM agents— Xiang Deng, Chen Bo Calvin Zhang, Tyler Tracy, Buck Shlegeris, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Henry Sleight
Recommendations for Technical AI Safety Research Directions— Anthropic Alignment Science Team
Constitutional Classifiers: Defending against universal jailbreaks— Anthropic Safeguards Research Team
Claude 4.5 Opus Soul Document— Richard-Weiss
Open-sourcing circuit tracing tools— Michael Hanna, Mateusz Piotrowski, Emmanuel Ameisen, Jack Lindsey, Johnny Lin, Curt Tigges