Shallow Review of Technical AI Safety, 2025

Name	Section▲	Summary	Papers	FTEs	Target Case	Approaches	Problems	Funded By	Names
Anthropic	Labs	—	21	—	—	—	—	Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Bai...	chris-olah, evan-hubinger, sam-marks, johannes-treutlein, sam-bowman, euan-ong, fabien-roger, adam-j...
China	Labs	—	0	—	—	—	—	—	—
Google Deepmind	Labs	—	14	—	—	—	—	Google. Explicit 2024 Deepmind spending as a whole was [£1.3...	rohin-shah, allan-dafoe, anca-dragan, alex-irpan, alex-turner, anna-wang, arthur-conmy, david-lindne...
Meta	Labs	—	6	—	—	—	—	Meta	shuchao-bi, hongyuan-zhan, jingyu-zhang, haozhu-wang, eric-michael-smith, sid-wang, amr-sharaf, mahe...
OpenAI	Labs	—	12	—	—	—	—	Microsoft, [AWS](https://www.aboutamazon.com/news/aws/aws-op...	johannes-heidecke, boaz-barak, mia-glaese, jenny-nitishinskaya, lama-ahmad, naomi-bashkansky, miles-...
xAI	Labs	—	0	—	—	—	—	A16Z, Blackrock, Fidelity, Kingdom, Lightspeed, MGX, Morgan...	dan-hendrycks-advisor, juntang-zhuang, toby-pohlen, lianmin-zheng, piaoyang-cui, nikita-popov, ying-...
Assistance games, assistive agents	Black-box safety	Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.	5	—	—	—	1 10	Future of Life Institute, Coefficient Giving, Survival and F...	joar-skalse, anca-dragan, caspar-oesterheld, david-krueger, dylan-hafield-menell, stuart-russell
Black-box make-AI-solve-it	Black-box safety	Focus on using existing models to improve and align further models.	12	—	Average Case	Engineering	—	most of the industry	jacques-thibodeau, matthew-shingle, nora-belrose, lewis-hammond, geoffrey-irving
Capability removal: unlearning	Black-box safety	Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approache...	18	10-50	Pessimistic	—	8 12 10	Coefficient Giving, MacArthur Foundation, UK AI Safety Insti...	rowan-wang, avery-griffin, johannes-treutlein, zico-kolter, bruce-w-lee, addie-foote, alex-infanger,...
Chain of thought monitoring	Black-box safety	Supervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.	17	10-100	Average Case	Engineering	7 8 12	OpenAI, Anthropic, Google DeepMind	aether, bowen-baker, joost-huizinga, leo-gao, scott-emmons, erik-jenner, yanda-chen, james-chua, owa...
Character training and persona steering	Black-box safety	Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behavio...	13	—	Average Case	Cognitive	1	Anthropic, Coefficient Giving	truthful-ai, openai, anthropic, clr, amanda-askell, jack-lindsey, janus, theia-vogel, sharan-maiya,...
Control	Black-box safety	If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watch...	22	5-50	Worst Case	—	—	—	redwood, uk-aisi, deepmind, openai, anthropic, buck-shlegeris, ryan-greenblatt, kshitij-sachan, alex...
Data filtering	Black-box safety	Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.	4	10-50	Average Case	Engineering	4 1	Anthropic, various academics	yanda-chen, pratyush-maini, kyle-obrien, stephen-casper, simon-pepin-lehalleur, jesse-hoogland, hima...
Data poisoning defense	Black-box safety	Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.	3	5-20	Pessimistic	Engineering	8 11	Google DeepMind, Anthropic, University of Cambridge, Vector...	alexandra-souly, javier-rando, ed-chapman, hanna-foerster, ilia-shumailov, yiren-zhao
Data quality for alignment	Black-box safety	Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.	5	20-50	Average Case	Engineering	7 1	Anthropic, Google DeepMind, OpenAI, Meta AI, various academi...	maarten-buyl, kelsey-kraus, margaret-kroll, danqing-shi
Emergent misalignment	Black-box safety	Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained...	17	10-50	Pessimistic	Behavioral	4 7	Coefficient Giving, >$1 million	truthful-ai, jan-betley, james-chua, mia-taylor, miles-wang, edward-turner, anna-soligo, alex-cloud,...
Harm reduction for open weights	Black-box safety	Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or...	5	10-100	Average Case	Engineering	11	UK AI Safety Institute (AISI), EleutherAI, Coefficient Givin...	kyle-obrien, stephen-casper, quentin-anthony, tomek-korbak, rishub-tamirisa, mantas-mazeika, stella-...
Hyperstition studies	Black-box safety	Study, steer, and intervene on the following feedback loop: "we produce stories about how present and future AI systems behave" → "these stories become training data for the AI" → "these stories shape...	4	1-10	Average Case	Cognitive	1	Unclear, niche	alex-turner, hyperstition-aihttpswwwhyperstitionaicom, kyle-obrien
Inference-time: In-context learning	Black-box safety	Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.	5	—	Average Case	Engineering	—	—	jacob-steinhardt, kayo-yin, atticus-geiger
Inference-time: Steering	Black-box safety	Manipulate an LLM's internal representations/token probabilities without touching weights.	4	—	Average Case	Engineering	—	—	taylor-sorensen, constanza-fierro, kshitish-ghate, arthur-vogels
Inoculation prompting	Black-box safety	Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.	4	—	Average Case	Engineering	—	most of the industry	ariana-azarbal, daniel-tan, victor-gillioz, alex-turner, alex-cloud, monte-macdiarmid, daniel-ziegle...
Iterative alignment at post-train-time	Black-box safety	Modify weights after pre-training.	16	—	Average Case	Engineering	—	most of the industry	adam-gleave, anca-dragan, jacob-steinhardt, rohin-shah
Iterative alignment at pretrain-time	Black-box safety	Guide weights during pretraining.	2	—	Average Case	Engineering	—	most of the industry	jan-leike, stuart-armstrong, cyrus-cousins, oliver-daniels
Mild optimisation	Black-box safety	Avoid Goodharting by getting AI to satisfice rather than maximise.	4	10-50	—	Cognitive	1	Google DeepMind	—
Model psychopathology	Black-box safety	Find interesting LLM phenomena like glitch [tokens](https://vgel.me/posts/seahorse/) and the reversal curse; these are vital data for theory.	9	5-20	Pessimistic	—	4	Coefficient Giving (via Truthful AI and Interpretability gra...	janus, truthful-ai, theia-vogel, stewart-slocum, nell-watson, samuel-g-b-johnson, liwei-jiang, monik...
Model specs and constitutions	Black-box safety	Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.	11	—	Average Case	Engineering	1	major funders include Anthropic and OpenAI (internally)	amanda-askell, joe-carlsmith
Model values / model preferences	Black-box safety	Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.	14	30	Pessimistic	Cognitive	1	Coefficient Giving. $289,000 SFF funding for CAIS.	mantas-mazeika, xuwang-yin, rishub-tamirisa, jaehyuk-lim, bruce-w-lee, richard-ren, long-phan, norma...
RL safety	Black-box safety	Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.	11	20-70	Pessimistic	Engineering	4 1 7	Google DeepMind, University of Oxford, CMU, Coefficient Givi...	joar-skalse, karim-abdel-sadek, matthew-farrugia-roberts, benjamin-plaut, fang-wu, stephen-zhao, ale...
Safeguards (inference-time auxiliaries)	Black-box safety	Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.	6	100+	Average Case	Engineering	7 12	most of the big labs	mrinank-sharma, meg-tong, jesse-mu, alwin-peng, julian-michael, henry-sleight, theodore-sumers, raj-...
Synthetic data for alignment	Black-box safety	Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.	8	50-150	Average Case	Engineering	4 7 1	Anthropic, Google DeepMind, OpenAI, Meta AI, various academi...	mianqiu-huang, xiaoran-liu, rylan-schaeffer, nevan-wichers, aram-ebtekar, jiaxin-wen, vishakh-padmak...
The "Neglected Approaches" Approach	Black-box safety	Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.	3	15	Average Case	Engineering	11	AE Studio	ae-studio, gunnar-zarncke, cameron-berg, michael-vaiana, judd-rosenblatt, diogo-schwerz-de-lucena
Activation engineering	White-box safety	Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.	15	20-100	Average Case	—	1	Coefficient Giving, Anthropic	runjin-chen, andy-arditi, david-krueger, jan-wehner, narmeen-oozeer, reza-bayat, adam-karvonen, jiud...
Causal Abstractions	White-box safety	Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.	3	10-30	Worst Case	Cognitive	4	Various academic groups, Google DeepMind, Goodfire	atticus-geiger, christopher-potts, thomas-icard, theodora-mara-pslar, sara-magliacane, jiuding-sun,...
Data attribution	White-box safety	Quantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back t...	12	30-60	Average Case	Behavioral	4 1	Various academic groups	roger-grosse, philipp-alexander-kreer, jin-hwa-lee, matthew-smith, abhilasha-ravichander, andrew-wan...
Extracting latent knowledge	White-box safety	Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.	9	20-40	Worst Case	Cognitive	7	Open Philanthropy, Anthropic, NSF, various academic grants	bartosz-cywiski, emil-ryd, senthooran-rajamanoharan, alexander-pan, lijie-chen, jacob-steinhardt, ja...
Human inductive biases	White-box safety	Discover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans a...	6	4	Pessimistic	Cognitive	4	Google DeepMind, various academic groups	lukas-muttenthaler, quentin-delfosse
Learning dynamics and developmental interpretability	White-box safety	Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context...	14	10-50	Worst Case	Cognitive	4	Manifund, Survival and Flourishing Fund, EA Funds	timaeus, jesse-hoogland, george-wang, daniel-murfet, stan-van-wingerden, alexander-gietelink-oldenzi...
Lie and deception detectors	White-box safety	Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model state...	11	10-50	Pessimistic	Cognitive	—	Anthropic, Deepmind, UK AISI, Coefficient Giving	cadenza, sam-marks, rowan-wang, kieron-kretschmar, sharan-maiya, walter-laurito, chris-cundy, adam-g...
Model diffing	White-box safety	Understand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.	9	10-30	Pessimistic	Cognitive	1	various academic groups, Anthropic, Google DeepMind	julian-minder, clment-dumas, neel-nanda, trenton-bricken, jack-lindsey
Monitoring concepts	White-box safety	Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them...	11	50-100	Pessimistic	Cognitive	1 4 12	Coefficient Giving, Anthropic, various academic groups	daniel-beaglehole, adityanarayanan-radhakrishnan, enric-boix-adser, tom-wollschlger, anna-soligo, ja...
Other interpretability	White-box safety	Interpretability that does not fall well into other categories.	19	30-60	—	—	7 4	—	lee-sharkey, dario-amodei, david-chalmers, been-kim, neel-nanda, david-d-baek, lauren-greenspan, dmi...
Pragmatic interpretability	White-box safety	Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pur...	3	30-60	—	Cognitive	7 4	Google DeepMind, Anthropic, various academic groups	lee-sharkey, dario-amodei, david-chalmers, been-kim, neel-nanda, david-d-baek, lauren-greenspan, dmi...
Representation structure and geometry	White-box safety	What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?	13	10-50	—	Cognitive	4 7	Various academic groups, Astera Institute, Coefficient Givin...	simplex, insight-interaction-lab, paul-riechers, adam-shai, martin-wattenberg, blake-richards, mateu...
Reverse engineering	White-box safety	Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's...	33	100-200	Worst Case	Cognitive	4 7	—	lucius-bushnaq, dan-braun, lee-sharkey, aaron-mueller, atticus-geiger, sheridan-feucht, david-bau, y...
Sparse Coding	White-box safety	Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.	44	50-100	Average Case	—	1 4 7	everyone, roughly. Frontier labs, LTFF, Coefficient Giving,...	leo-gao, dan-mossing, emmanuel-ameisen, jack-lindsey, adam-pearce, thomas-heap, abhinav-menon, kenny...
Brainlike-AGI Safety	Safety by construction	Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. "a yet-to-b...	6	1-5	Worst Case	Cognitive	—	Astera Institute	steve-byrnes
Guaranteed-Safe AI	Safety by construction	Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.	5	10-100	Worst Case	—	1 4 7 9 12	Manifund, ARIA, Coefficient Giving, Survival and Flourishing...	aria, lawzero, atlas-computing, flf, max-tegmark, beneficial-ai-foundation, steve-omohundro, david-d...
Scientist AI	Safety by construction	Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs	2	1-10	Pessimistic	Cognitive	3 4 5	ARIA, Gates Foundation, Future of Life Institute, Coefficien...	yoshua-bengio, younesse-kaddar
AI explanations of AIs	Make AI solve it	Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that...	5	15-30	Pessimistic	Cognitive	7 8	Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Z...	transluce, jacob-steinhardt, neil-chowdhury, vincent-huang, sarah-schwettmann, robert-friel
Debate	Make AI solve it	In the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.	6	—	Worst Case	—	1 7	Google, others	rohin-shah, jonah-brown-cohen, georgios-piliouras, uk-aisi-benjamin-holton
LLM introspection training	Make AI solve it	Train LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own 'introspective' access	2	2-20	—	Cognitive	4 7 8	Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Z...	belinda-z-li, zifan-carl-guo, vincent-huang, jacob-steinhardt, jacob-andreas, jack-lindsey
Supervising AIs improving AIs	Make AI solve it	Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, bench...	8	1-10	Pessimistic	Behavioral	7 8	Long-Term Future Fund, lab funders	roman-engeler, akbir-khan, ethan-perez
Weak-to-strong generalization	Make AI solve it	Use weaker models to supervise and provide a feedback signal to stronger models.	4	2-20	Average Case	Engineering	8	lab funders, Eleuther funders	joshua-engels, nora-belrose, david-d-baek
Agent foundations	Theory	Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theor...	10	—	Worst Case	Cognitive	1 2 4	—	abram-demski, alex-altair, sam-eisenstat, thane-ruthenis, alfred-harwood, daniel-c, dalcy-k, jos-ped...
Asymptotic guarantees	Theory	Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory,...	4	5 - 10	Pessimistic	Cognitive	4 7	AISI	aisi, jacob-pfau, benjamin-hilton, geoffrey-irving, simon-marshall, will-kirby, martin-soto, david-a...
Behavior alignment theory	Theory	Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.	10	1-10	Worst Case	—	2 5	—	ram-potham, michael-k-cohen, max-harmsraelifin, john-wentworth, david-lorell, elliott-thornley
Heuristic explanations	Theory	Formalize mechanistic explanations of neural network behavior, automate the discovery of these "heuristic explanations" and use them to predict when novel input will lead to extreme behavior (i.e. "Lo...	5	1-10	Worst Case	—	4 8	—	jacob-hilton, mark-xu, eric-neyman, victor-lecomte, george-robinson
High-Actuation Spaces	Theory	Mech interp and alignment assume a stable "computational substrate" (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will...	7	1-10	Pessimistic	—	—	—	sahil-k, matt-farr, aditya-arpitha-prasad, chris-pang, aditya-adiga, jayson-amati, steve-petersen, t...
Natural abstractions	Theory	Develop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of...	10	1-10	Worst Case	Cognitive	5 7 9	—	john-wentworth, paul-colognese, david-lorrell, sam-eisenstat, fernando-rosas
Other corrigibility	Theory	Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors	9	1-10	Pessimistic	—	2 5	—	jeremy-gillen
The Learning-Theoretic Agenda	Theory	Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology...	6	3	Worst Case	Cognitive	1 4 9	Survival and Flourishing Fund, ARIA, UK AISI, Coefficient Gi...	vanessa-kosoy, diffractor, gergely-szcs
Tiling agents	Theory	An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.	4	1-10	Worst Case	Cognitive	1 2 4	—	abram-demski
Aligned to who?	Multi-agent first	Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to "humanity"	9	5 - 15	Average Case	Behavioral	1 13	Future of Life Institute, Survival and Flourishing Fund, Dee...	joel-z-leibo, divya-siddarth, sb-krier, luke-thorburn, seth-lazar, ai-objectives-institute, the-coll...
Aligning to context	Multi-agent first	Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility f...	8	5	—	Behavioral	1 2 4 5 13	ARIA, OpenAI, Survival and Flourishing Fund	full-stack-alignment, meaning-alignment-institute, plurality-institute, tan-zhi-xuan, matija-frankli...
Aligning to the social contract	Multi-agent first	Generate AIs' operational values from 'social contract'-style ideal civic deliberation formalisms and their consequent rulesets for civic actors	8	5 - 10	—	Cognitive	1 4 5 10 13	Deepmind, Macroscopic Ventures	gillian-hadfield, tan-zhi-xuan, sydney-levine, matija-franklin, joshua-b-tenenbaum
Aligning what?	Multi-agent first	Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/constituti...	13	5-10	—	—	1 2 4 5 13	Future of Life Institute, Emmett Shear	richard-ngo, emmett-shear, softmax, full-stack-alignment, ai-objectives-institute, sahil, tj, andrew...
Theory for aligning multiple AIs	Multi-agent first	Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/predict the collective and individual behaviours of AI ag...	12	10	—	Cognitive	4 7 8	SFF, CAIF, Deepmind, Macroscopic Ventures	lewis-hammond, emery-cooper, allan-chan, caspar-oesterheld, vincent-conitzer, vojta-kovarik, nathani...
Tools for aligning multiple AIs	Multi-agent first	Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.	12	10 - 15	—	—	4 7 8	Coefficient Giving, Deepmind, Cooperative AI Foundation	andrew-critch, lewis-hammond, emery-cooper, allan-chan, caspar-oesterheld, vincent-conitzer, gillian...
AGI metrics	Evals	Evals with the explicit aim of measuring progress towards full human-level generality.	5	10-50	—	Behavioral	—	Leverhulme Trust, Open Philanthropy, Long-Term Future Fund	cais, cfi-kinds-of-intelligence, apart-research, openai, metr, lexin-zhou, adam-scholl, lorenzo-pacc...
AI deception evals	Evals	research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.	13	30-80	Worst Case	—	7 8	Labs, academic institutions (e.g., Harvard, CMU, Barcelona I...	cadenza, fred-heiding, simon-lermen, andrew-kao, myra-cheng, cinoo-lee, pranav-khadpe, satyapriya-kr...
AI scheming evals	Evals	Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and complian...	7	30-60	Pessimistic	—	7	OpenAI, Anthropic, Google DeepMind, Open Philanthropy	bronson-schoen, alexander-meinke, jason-wolfe, mary-phuong, rohin-shah, evgenia-nitishinskaya, mikit...
Autonomy evals	Evals	Measure an AI's ability to act autonomously to complete long-horizon, complex tasks.	13	10-50	Average Case	Behavioral	—	The Audacious Project, Open Philanthropy	metr, thomas-kwa, ben-west, joel-becker, beth-barnes, hjalmar-wijk, tao-lin, giulio-starace, oliver-...
Capability evals	Evals	Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.	34	100+	Average Case	Behavioral	—	basically everyone. Google, Microsoft, Open Philanthropy, LT...	metr, aisi, apollo-research, marrius-hobbhahn, meg-tong, mary-phuong, beth-barnes, thomas-kwa, joel-...
Other evals	Evals	A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.	20	20-50	Average Case	Behavioral	—	Lab funders (OpenAI), Open Philanthropy (which funds CAIS, t...	richard-ren, mantas-mazeika, andrs-corrada-emmanuel, ariba-khan, stephen-casper
Sandbagging evals	Evals	Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.	9	10-50	Pessimistic	Behavioral	7 8	Anthropic (and its funders, e.g., Google, Amazon), UK Govern...	teun-van-der-weij, cameron-tice, chloe-li, johannes-gasteiger, joseph-bloom, joel-dyer
Self-replication evals	Evals	evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.	3	10-20	Worst Case	Behavioral	5 12	UK Government (via UK AI Safety Institute)	sid-black, asa-cooper-stickland, jake-pencharz, oliver-sourbut, michael-schmatz, jay-bailey, ollie-m...
Situational awareness and self-awareness evals	Evals	Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.	11	30-70	Worst Case	Behavioral	7 8	frontier labs (Google DeepMind, Anthropic), Open Philanthrop...	jan-betley, xuchan-bao, martn-soto, mary-phuong, roland-s-zimmermann, joe-needham, giles-edkins, gov...
Steganography evals	Evals	evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.	5	1-10	Worst Case	Behavioral	12 7	Anthropic (and its general funders, e.g., Google, Amazon)	antonio-norelli, michael-bronstein
Various Redteams	Evals	attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.	57	100+	Average Case	Behavioral	12 4	Frontier labs (Anthropic, OpenAI, Google), government (UK AI...	ryan-greenblatt, benjamin-wright, aengus-lynch, john-hughes, samuel-r-bowman, andy-zou, nicholas-car...
WMD evals (Weapons of Mass Destruction)	Evals	Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.	6	10-50	Pessimistic	Behavioral	—	Open Philanthropy, UK AI Safety Institute (AISI), frontier l...	lennart-justen, haochen-zhao, xiangru-tang, ziran-yang, aidan-peppin, anka-reuel, stephen-casper