Shallow Review of Technical AI Safety, 2025

Other evals

A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.
Theory of Change:By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.
General Approach:Behavioral
Target Case:Average Case
See Also:
other more specific sections on evals
Some names:Richard Ren, Mantas Mazeika, Stephen Casper
Estimated FTEs:20-50
Outputs:
Shutdown Resistance in Large Language ModelsJeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent SafetySanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap
Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providersJared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, Nick Haber
Lessons from a Chimp: AI "Scheming" and the Quest for Ape LanguageChristopher Summerfield, Lennart Luettgau, Magda Dubois, Hannah Rose Kirk, Kobi Hackenburg, Catherine Fist, Katarina Slama, Nicola Ding, Rebecca Anselmetti, Andrew Strait, Mario Giulianelli, Cozmin Ududec
Establishing Best Practices for Building Rigorous Agentic BenchmarksYuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang
Sycophantic AI Decreases Prosocial Intentions and Promotes DependenceMyra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, Dan Jurafsky
AI Testing Should Account for Sophisticated Strategic BehaviourVojtech Kovarik, Eric Olav Chen, Sami Petersen, Alexis Ghersengorin, Vincent Conitzer
Spiral-BenchSam Paech
Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMsDaniel Kilov, Caroline Hendy, Secil Yanik Guyot, Aaron J. Snoswell, Seth Lazar
CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence ReasoningLauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, Brad Moon, Bruno Ostarcevic, Alberto Peltea, Matt Rosenberg, Catalin Sandu, Arthur Saputkin, Sagar Shah, Daniel Stan, Ernest Szocs, Shengye Wan, Spencer Whitman, Sven Krasser, Joshua Saxe