Evals
AGI metrics
5 papersEvals with the explicit aim of measuring progress towards full human-level generality.
Capability evals
34 papersMake tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Autonomy evals
13 papersMeasure an AI's ability to act autonomously to complete long-horizon, complex tasks.
WMD evals (Weapons of Mass Destruction)
6 papersEvaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.
Situational awareness and self-awareness evals
11 papersEvaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Steganography evals
5 papersevaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.
AI deception evals
13 papersresearch demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.
AI scheming evals
7 papersEvaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.
Sandbagging evals
9 papersEvaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.
Self-replication evals
3 papersevaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.
Various Redteams
57 papersattack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Other evals
20 papersA collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.