Shallow Review of Technical AI Safety, 2025

Evals

AGI metrics

5 papers

Evals with the explicit aim of measuring progress towards full human-level generality.

Capability evals

34 papers

Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.

Autonomy evals

13 papers

Measure an AI's ability to act autonomously to complete long-horizon, complex tasks.

WMD evals (Weapons of Mass Destruction)

6 papers

Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.

Situational awareness and self-awareness evals

11 papers

Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.

Steganography evals

5 papers

evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.

AI deception evals

13 papers

research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.

AI scheming evals

7 papers

Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.

Sandbagging evals

9 papers

Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.

Self-replication evals

3 papers

evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.

Various Redteams

57 papers

attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.

Other evals

20 papers

A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.