Other evals
A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.
Theory of Change:By developing novel benchmarks for specific, hard-to-measure properties (like honesty), critiquing the reliability of existing methods (like cultural surveys), and improving the formal rigor of evaluation systems (like LLM-as-Judges), researchers can create a more robust and comprehensive suite of evaluations to catch failures missed by standard capability or safety testing.
General Approach:Behavioral
Target Case:Average Case
See Also:
other more specific sections on evals
Some names:Richard Ren, Mantas Mazeika, Stephen Casper
Estimated FTEs:20-50
Outputs:
Shutdown Resistance in Large Language Models— Jeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety— Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap
Do LLMs Comply Differently During Tests? Is This a Hidden Variable in Safety Evaluation? And Can We Steer That?— Sahar Abdelnabi, Ahmed Salem
Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)— Roland Pihlakas, Sruthi Susan Kuriakose, Shruti Datta Gupta
Syco-bench: A Benchmark for LLM Sycophancy— Tim Duffy
Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers— Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, Nick Haber
Lessons from a Chimp: AI "Scheming" and the Quest for Ape Language— Christopher Summerfield, Lennart Luettgau, Magda Dubois, Hannah Rose Kirk, Kobi Hackenburg, Catherine Fist, Katarina Slama, Nicola Ding, Rebecca Anselmetti, Andrew Strait, Mario Giulianelli, Cozmin Ududec
Establishing Best Practices for Building Rigorous Agentic Benchmarks— Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang
Logical Consistency Between Disagreeing Experts and Its Role in AI Safety— Andrés Corrada-Emmanuel
Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence— Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, Dan Jurafsky
AI Testing Should Account for Sophisticated Strategic Behaviour— Vojtech Kovarik, Eric Olav Chen, Sami Petersen, Alexis Ghersengorin, Vincent Conitzer
Spiral-Bench— Sam Paech
Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs— Daniel Kilov, Caroline Hendy, Secil Yanik Guyot, Aaron J. Snoswell, Seth Lazar
CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning— Lauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, Brad Moon, Bruno Ostarcevic, Alberto Peltea, Matt Rosenberg, Catalin Sandu, Arthur Saputkin, Sagar Shah, Daniel Stan, Ernest Szocs, Shengye Wan, Spencer Whitman, Sven Krasser, Joshua Saxe