Autonomy evals
Measure an AI's ability to act autonomously to complete long-horizon, complex tasks.
Theory of Change:By measuring how long and complex a task an AI can complete (its "time horizon"), we can track capability growth and identify when models gain dangerous autonomous capabilities (like R&D acceleration or replication).
General Approach:Behavioral
Target Case:Average Case
See Also:
Estimated FTEs:10-50
Outputs:
Measuring AI Ability to Complete Long Tasks— Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts— Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Holden Karnofsky, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents— Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, Maksym Andriushchenko
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety— Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap
PaperBench: Evaluating AI's Ability to Replicate AI Research— Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Chan Jun Shern, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Mia Glaese, Tejal Patwardhan
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents— Axel Backlund, Lukas Petersson
Forecasting Frontier Language Model Agent Capabilities— Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn
GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments— Hanlin Zhu, Tianyu Guo, Song Mei, Stuart Russell, Nikhil Ghosh, Alberto Bietti, Jiantao Jiao