Capability evals
Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Theory of Change:Keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary (and hopefully they extrapolate). You can't regulate without them.
General Approach:Behavioral
Target Case:Average Case
See Also:
Some names:AISI, Meg Tong
Estimated FTEs:100+
Outputs:
MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity— Neev Parikh, Hjalmar Wijk
Forecasting Rare Language Model Behaviors— Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities— Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
The Elicitation Game: Evaluating Capability Elicitation Techniques— Felix Hofstätter, Teun van der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward
Evaluating Language Model Reasoning about Confidential Information— Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter
Evaluating the Goal-Directedness of Large Language Models— Tom Everitt, Cristina Garbacea, Alexis Bellot, Jonathan Richens, Henry Papadatos, Siméon Campos, Rohin Shah
Automated Capability Discovery via Foundation Model Self-Exploration— Cong Lu, Shengran Hu, Jeff Clune
Generative Value Conflicts Reveal LLM Priorities— Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh, Max Kleiman-Weiner
Technical Report: Evaluating Goal Drift in Language Model Agents— Rauno Arike, Elizabeth Donoway, Henning Bartsch, Marius Hobbhahn
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas— Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons— Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade-Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami, Usman Gohar, Ben Huang, Supheakmungkol Sarin, Elie Alhajjar, Canyu Chen, Roman Eng, Kashyap Ramanandula Manjusha, Virendra Mehta, Eileen Long, Murali Emani, Natan Vidra, Benjamin Rukundo, Abolfazl Shahbazi, Kongtao Chen, Rajat Ghosh, Vithursan Thangarasa, Pierre Peigné, Abhinav Singh, Max Bartolo, Satyapriya Krishna, Mubashara Akhtar, Rafael Gold, Cody Coleman, Luis Oala, Vassil Tashev, Joseph Marvin Imperial, Amy Russ, Sasidhar Kunapuli, Nicolas Miailhe, Julien Delaunay, Bhaktipriya Radharapu, Rajat Shinde, Tuesday, Debojyoti Dutta, Declan Grabb, Ananya Gangavarapu, Saurav Sahay, Agasthya Gangavarapu, Patrick Schramowski, Stephen Singam, Tom David, Xudong Han, Priyanka Mary Mammen, Tarunima Prabhakar, Venelin Kovatchev, Rebecca Weiss, Ahmed Ahmed, Kelvin N. Manyeki, Sandeep Madireddy, Foutse Khomh, Fedor Zhdanov, Joachim Baumann, Nina Vasan, Xianjun Yang, Carlos Mougn, Jibin Rajan Varghese, Hussain Chinoy, Seshakrishna Jitendar, Manil Maskey, Claire V. Hardgrove, Tianhao Li, Aakash Gupta, Emil Joswin, Yifan Mai, Shachi H Kumar, Cigdem Patlak, Kevin Lu, Vincent Alessi, Sree Bhargavi Balija, Chenhe Gu, Robert Sullivan, James Gealy, Matt Lavrisa, James Goel, Peter Mattson, Percy Liang, Joaquin Vanschoren
Petri: An open-source auditing tool to accelerate AI safety research— Kai Fronsdal, Isha Gupta, Abhay Sheshadri, Jonathan Michala, Stephen McAleer, Rowan Wang, Sara Price, Samuel R. Bowman
Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals— Marius Hobbhahn
New website analyzing AI companies' model evals— Zach Stein-Perlman
Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals— Marius Hobbhahn
How Fast Can Algorithms Advance Capabilities? | Epoch Gradient Update— Henry Josephson, Spencer Guo, Teddy Foley, Jack Sanderson, Anqi Qu
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods— markov, Charbel-Raphaël
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate— Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr
Predicting the Performance of Black-box LLMs through Self-Queries— Dylan Sam, Marc Finzi, J. Zico Kolter
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods— Markov Grey, Charbel-Raphaël Segerie
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index— Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
We should try to automate AI safety work asap— Marius Hobbhahn
Validating against a misalignment detector is very different to training against one— mattmacdermott
Why do misalignment risks increase as AIs get more capable?— Ryan Greenblatt
Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas— jake_mendel, maxnadeau, Peter Favaloro
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks— Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang
Why Future AIs will Require New Alignment Methods— Alvin Ånestrand
100+ concrete projects and open problems in evals— Marius Hobbhahn
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input— Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, Dipanjan Das