Name | Section▲ | Summary | Papers | FTEs | Target Case | Approaches | Problems | Funded By | Names |
|---|---|---|---|---|---|---|---|---|---|
| Anthropic | Labs | — | 21 | — | — | — | — | Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Bai... | chris-olah, evan-hubinger, sam-marks, johannes-treutlein, sam-bowman, euan-ong, fabien-roger, adam-j... |
| China | Labs | — | 0 | — | — | — | — | — | — |
| Google Deepmind | Labs | — | 14 | — | — | — | — | Google. Explicit 2024 Deepmind spending as a whole was [£1.3... | rohin-shah, allan-dafoe, anca-dragan, alex-irpan, alex-turner, anna-wang, arthur-conmy, david-lindne... |
| Meta | Labs | — | 6 | — | — | — | — | Meta | shuchao-bi, hongyuan-zhan, jingyu-zhang, haozhu-wang, eric-michael-smith, sid-wang, amr-sharaf, mahe... |
| OpenAI | Labs | — | 12 | — | — | — | — | Microsoft, [AWS](https://www.aboutamazon.com/news/aws/aws-op... | johannes-heidecke, boaz-barak, mia-glaese, jenny-nitishinskaya, lama-ahmad, naomi-bashkansky, miles-... |
| xAI | Labs | — | 0 | — | — | — | — | A16Z, Blackrock, Fidelity, Kingdom, Lightspeed, MGX, Morgan... | dan-hendrycks-advisor, juntang-zhuang, toby-pohlen, lianmin-zheng, piaoyang-cui, nikita-popov, ying-... |
| Assistance games, assistive agents | Black-box safety | Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn. | 5 | — | — | — | Future of Life Institute, Coefficient Giving, Survival and F... | joar-skalse, anca-dragan, caspar-oesterheld, david-krueger, dylan-hafield-menell, stuart-russell | |
| Black-box make-AI-solve-it | Black-box safety | Focus on using existing models to improve and align further models. | 12 | — | Average Case | — | most of the industry | jacques-thibodeau, matthew-shingle, nora-belrose, lewis-hammond, geoffrey-irving | |
| Capability removal: unlearning | Black-box safety | Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approache... | 18 | 10-50 | Pessimistic | — | Coefficient Giving, MacArthur Foundation, UK AI Safety Insti... | rowan-wang, avery-griffin, johannes-treutlein, zico-kolter, bruce-w-lee, addie-foote, alex-infanger,... | |
| Chain of thought monitoring | Black-box safety | Supervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states. | 17 | 10-100 | Average Case | OpenAI, Anthropic, Google DeepMind | aether, bowen-baker, joost-huizinga, leo-gao, scott-emmons, erik-jenner, yanda-chen, james-chua, owa... | ||
| Character training and persona steering | Black-box safety | Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behavio... | 13 | — | Average Case | Anthropic, Coefficient Giving | truthful-ai, openai, anthropic, clr, amanda-askell, jack-lindsey, janus, theia-vogel, sharan-maiya,... | ||
| Control | Black-box safety | If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watch... | 22 | 5-50 | Worst Case | — | — | — | redwood, uk-aisi, deepmind, openai, anthropic, buck-shlegeris, ryan-greenblatt, kshitij-sachan, alex... |
| Data filtering | Black-box safety | Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment. | 4 | 10-50 | Average Case | Anthropic, various academics | yanda-chen, pratyush-maini, kyle-obrien, stephen-casper, simon-pepin-lehalleur, jesse-hoogland, hima... | ||
| Data poisoning defense | Black-box safety | Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data. | 3 | 5-20 | Pessimistic | Google DeepMind, Anthropic, University of Cambridge, Vector... | alexandra-souly, javier-rando, ed-chapman, hanna-foerster, ilia-shumailov, yiren-zhao | ||
| Data quality for alignment | Black-box safety | Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data. | 5 | 20-50 | Average Case | Anthropic, Google DeepMind, OpenAI, Meta AI, various academi... | maarten-buyl, kelsey-kraus, margaret-kroll, danqing-shi | ||
| Emergent misalignment | Black-box safety | Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained... | 17 | 10-50 | Pessimistic | Coefficient Giving, >$1 million | truthful-ai, jan-betley, james-chua, mia-taylor, miles-wang, edward-turner, anna-soligo, alex-cloud,... | ||
| Harm reduction for open weights | Black-box safety | Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or... | 5 | 10-100 | Average Case | UK AI Safety Institute (AISI), EleutherAI, Coefficient Givin... | kyle-obrien, stephen-casper, quentin-anthony, tomek-korbak, rishub-tamirisa, mantas-mazeika, stella-... | ||
| Hyperstition studies | Black-box safety | Study, steer, and intervene on the following feedback loop: "we produce stories about how present and future AI systems behave" → "these stories become training data for the AI" → "these stories shape... | 4 | 1-10 | Average Case | Unclear, niche | alex-turner, hyperstition-aihttpswwwhyperstitionaicom, kyle-obrien | ||
| Inference-time: In-context learning | Black-box safety | Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior. | 5 | — | Average Case | — | — | jacob-steinhardt, kayo-yin, atticus-geiger | |
| Inference-time: Steering | Black-box safety | Manipulate an LLM's internal representations/token probabilities without touching weights. | 4 | — | Average Case | — | — | taylor-sorensen, constanza-fierro, kshitish-ghate, arthur-vogels | |
| Inoculation prompting | Black-box safety | Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour. | 4 | — | Average Case | — | most of the industry | ariana-azarbal, daniel-tan, victor-gillioz, alex-turner, alex-cloud, monte-macdiarmid, daniel-ziegle... | |
| Iterative alignment at post-train-time | Black-box safety | Modify weights after pre-training. | 16 | — | Average Case | — | most of the industry | adam-gleave, anca-dragan, jacob-steinhardt, rohin-shah | |
| Iterative alignment at pretrain-time | Black-box safety | Guide weights during pretraining. | 2 | — | Average Case | — | most of the industry | jan-leike, stuart-armstrong, cyrus-cousins, oliver-daniels | |
| Mild optimisation | Black-box safety | Avoid Goodharting by getting AI to satisfice rather than maximise. | 4 | 10-50 | — | Google DeepMind | — | ||
| Model psychopathology | Black-box safety | Find interesting LLM phenomena like glitch [tokens](https://vgel.me/posts/seahorse/) and the reversal curse; these are vital data for theory. | 9 | 5-20 | Pessimistic | — | Coefficient Giving (via Truthful AI and Interpretability gra... | janus, truthful-ai, theia-vogel, stewart-slocum, nell-watson, samuel-g-b-johnson, liwei-jiang, monik... | |
| Model specs and constitutions | Black-box safety | Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment. | 11 | — | Average Case | major funders include Anthropic and OpenAI (internally) | amanda-askell, joe-carlsmith | ||
| Model values / model preferences | Black-box safety | Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans. | 14 | 30 | Pessimistic | Coefficient Giving. $289,000 SFF funding for CAIS. | mantas-mazeika, xuwang-yin, rishub-tamirisa, jaehyuk-lim, bruce-w-lee, richard-ren, long-phan, norma... | ||
| RL safety | Black-box safety | Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming. | 11 | 20-70 | Pessimistic | Google DeepMind, University of Oxford, CMU, Coefficient Givi... | joar-skalse, karim-abdel-sadek, matthew-farrugia-roberts, benjamin-plaut, fang-wu, stephen-zhao, ale... | ||
| Safeguards (inference-time auxiliaries) | Black-box safety | Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors. | 6 | 100+ | Average Case | most of the big labs | mrinank-sharma, meg-tong, jesse-mu, alwin-peng, julian-michael, henry-sleight, theodore-sumers, raj-... | ||
| Synthetic data for alignment | Black-box safety | Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models. | 8 | 50-150 | Average Case | Anthropic, Google DeepMind, OpenAI, Meta AI, various academi... | mianqiu-huang, xiaoran-liu, rylan-schaeffer, nevan-wichers, aram-ebtekar, jiaxin-wen, vishakh-padmak... | ||
| The "Neglected Approaches" Approach | Black-box safety | Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them. | 3 | 15 | Average Case | AE Studio | ae-studio, gunnar-zarncke, cameron-berg, michael-vaiana, judd-rosenblatt, diogo-schwerz-de-lucena | ||
| Activation engineering | White-box safety | Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning. | 15 | 20-100 | Average Case | — | Coefficient Giving, Anthropic | runjin-chen, andy-arditi, david-krueger, jan-wehner, narmeen-oozeer, reza-bayat, adam-karvonen, jiud... | |
| Causal Abstractions | White-box safety | Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations. | 3 | 10-30 | Worst Case | Various academic groups, Google DeepMind, Goodfire | atticus-geiger, christopher-potts, thomas-icard, theodora-mara-pslar, sara-magliacane, jiuding-sun,... | ||
| Data attribution | White-box safety | Quantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back t... | 12 | 30-60 | Average Case | Various academic groups | roger-grosse, philipp-alexander-kreer, jin-hwa-lee, matthew-smith, abhilasha-ravichander, andrew-wan... | ||
| Extracting latent knowledge | White-box safety | Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false. | 9 | 20-40 | Worst Case | Open Philanthropy, Anthropic, NSF, various academic grants | bartosz-cywiski, emil-ryd, senthooran-rajamanoharan, alexander-pan, lijie-chen, jacob-steinhardt, ja... | ||
| Human inductive biases | White-box safety | Discover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans a... | 6 | 4 | Pessimistic | Google DeepMind, various academic groups | lukas-muttenthaler, quentin-delfosse | ||
| Learning dynamics and developmental interpretability | White-box safety | Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context... | 14 | 10-50 | Worst Case | Manifund, Survival and Flourishing Fund, EA Funds | timaeus, jesse-hoogland, george-wang, daniel-murfet, stan-van-wingerden, alexander-gietelink-oldenzi... | ||
| Lie and deception detectors | White-box safety | Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model state... | 11 | 10-50 | Pessimistic | — | Anthropic, Deepmind, UK AISI, Coefficient Giving | cadenza, sam-marks, rowan-wang, kieron-kretschmar, sharan-maiya, walter-laurito, chris-cundy, adam-g... | |
| Model diffing | White-box safety | Understand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in. | 9 | 10-30 | Pessimistic | various academic groups, Anthropic, Google DeepMind | julian-minder, clment-dumas, neel-nanda, trenton-bricken, jack-lindsey | ||
| Monitoring concepts | White-box safety | Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them... | 11 | 50-100 | Pessimistic | Coefficient Giving, Anthropic, various academic groups | daniel-beaglehole, adityanarayanan-radhakrishnan, enric-boix-adser, tom-wollschlger, anna-soligo, ja... | ||
| Other interpretability | White-box safety | Interpretability that does not fall well into other categories. | 19 | 30-60 | — | — | — | lee-sharkey, dario-amodei, david-chalmers, been-kim, neel-nanda, david-d-baek, lauren-greenspan, dmi... | |
| Pragmatic interpretability | White-box safety | Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pur... | 3 | 30-60 | — | Google DeepMind, Anthropic, various academic groups | lee-sharkey, dario-amodei, david-chalmers, been-kim, neel-nanda, david-d-baek, lauren-greenspan, dmi... | ||
| Representation structure and geometry | White-box safety | What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry? | 13 | 10-50 | — | Various academic groups, Astera Institute, Coefficient Givin... | simplex, insight-interaction-lab, paul-riechers, adam-shai, martin-wattenberg, blake-richards, mateu... | ||
| Reverse engineering | White-box safety | Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's... | 33 | 100-200 | Worst Case | — | lucius-bushnaq, dan-braun, lee-sharkey, aaron-mueller, atticus-geiger, sheridan-feucht, david-bau, y... | ||
| Sparse Coding | White-box safety | Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts. | 44 | 50-100 | Average Case | — | everyone, roughly. Frontier labs, LTFF, Coefficient Giving,... | leo-gao, dan-mossing, emmanuel-ameisen, jack-lindsey, adam-pearce, thomas-heap, abhinav-menon, kenny... | |
| Brainlike-AGI Safety | Safety by construction | Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. "a yet-to-b... | 6 | 1-5 | Worst Case | — | Astera Institute | steve-byrnes | |
| Guaranteed-Safe AI | Safety by construction | Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model. | 5 | 10-100 | Worst Case | — | Manifund, ARIA, Coefficient Giving, Survival and Flourishing... | aria, lawzero, atlas-computing, flf, max-tegmark, beneficial-ai-foundation, steve-omohundro, david-d... | |
| Scientist AI | Safety by construction | Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs | 2 | 1-10 | Pessimistic | ARIA, Gates Foundation, Future of Life Institute, Coefficien... | yoshua-bengio, younesse-kaddar | ||
| AI explanations of AIs | Make AI solve it | Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that... | 5 | 15-30 | Pessimistic | Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Z... | transluce, jacob-steinhardt, neil-chowdhury, vincent-huang, sarah-schwettmann, robert-friel | ||
| Debate | Make AI solve it | In the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters. | 6 | — | Worst Case | — | Google, others | rohin-shah, jonah-brown-cohen, georgios-piliouras, uk-aisi-benjamin-holton | |
| LLM introspection training | Make AI solve it | Train LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own 'introspective' access | 2 | 2-20 | — | Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Z... | belinda-z-li, zifan-carl-guo, vincent-huang, jacob-steinhardt, jacob-andreas, jack-lindsey | ||
| Supervising AIs improving AIs | Make AI solve it | Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, bench... | 8 | 1-10 | Pessimistic | Long-Term Future Fund, lab funders | roman-engeler, akbir-khan, ethan-perez | ||
| Weak-to-strong generalization | Make AI solve it | Use weaker models to supervise and provide a feedback signal to stronger models. | 4 | 2-20 | Average Case | lab funders, Eleuther funders | joshua-engels, nora-belrose, david-d-baek | ||
| Agent foundations | Theory | Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theor... | 10 | — | Worst Case | — | abram-demski, alex-altair, sam-eisenstat, thane-ruthenis, alfred-harwood, daniel-c, dalcy-k, jos-ped... | ||
| Asymptotic guarantees | Theory | Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory,... | 4 | 5 - 10 | Pessimistic | AISI | aisi, jacob-pfau, benjamin-hilton, geoffrey-irving, simon-marshall, will-kirby, martin-soto, david-a... | ||
| Behavior alignment theory | Theory | Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them. | 10 | 1-10 | Worst Case | — | — | ram-potham, michael-k-cohen, max-harmsraelifin, john-wentworth, david-lorell, elliott-thornley | |
| Heuristic explanations | Theory | Formalize mechanistic explanations of neural network behavior, automate the discovery of these "heuristic explanations" and use them to predict when novel input will lead to extreme behavior (i.e. "Lo... | 5 | 1-10 | Worst Case | — | — | jacob-hilton, mark-xu, eric-neyman, victor-lecomte, george-robinson | |
| High-Actuation Spaces | Theory | Mech interp and alignment assume a stable "computational substrate" (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will... | 7 | 1-10 | Pessimistic | — | — | — | sahil-k, matt-farr, aditya-arpitha-prasad, chris-pang, aditya-adiga, jayson-amati, steve-petersen, t... |
| Natural abstractions | Theory | Develop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of... | 10 | 1-10 | Worst Case | — | john-wentworth, paul-colognese, david-lorrell, sam-eisenstat, fernando-rosas | ||
| Other corrigibility | Theory | Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors | 9 | 1-10 | Pessimistic | — | — | jeremy-gillen | |
| The Learning-Theoretic Agenda | Theory | Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology... | 6 | 3 | Worst Case | Survival and Flourishing Fund, ARIA, UK AISI, Coefficient Gi... | vanessa-kosoy, diffractor, gergely-szcs | ||
| Tiling agents | Theory | An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening. | 4 | 1-10 | Worst Case | — | abram-demski | ||
| Aligned to who? | Multi-agent first | Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to "humanity" | 9 | 5 - 15 | Average Case | Future of Life Institute, Survival and Flourishing Fund, Dee... | joel-z-leibo, divya-siddarth, sb-krier, luke-thorburn, seth-lazar, ai-objectives-institute, the-coll... | ||
| Aligning to context | Multi-agent first | Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility f... | 8 | 5 | — | ARIA, OpenAI, Survival and Flourishing Fund | full-stack-alignment, meaning-alignment-institute, plurality-institute, tan-zhi-xuan, matija-frankli... | ||
| Aligning to the social contract | Multi-agent first | Generate AIs' operational values from 'social contract'-style ideal civic deliberation formalisms and their consequent rulesets for civic actors | 8 | 5 - 10 | — | Deepmind, Macroscopic Ventures | gillian-hadfield, tan-zhi-xuan, sydney-levine, matija-franklin, joshua-b-tenenbaum | ||
| Aligning what? | Multi-agent first | Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/constituti... | 13 | 5-10 | — | — | Future of Life Institute, Emmett Shear | richard-ngo, emmett-shear, softmax, full-stack-alignment, ai-objectives-institute, sahil, tj, andrew... | |
| Theory for aligning multiple AIs | Multi-agent first | Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/predict the collective and individual behaviours of AI ag... | 12 | 10 | — | SFF, CAIF, Deepmind, Macroscopic Ventures | lewis-hammond, emery-cooper, allan-chan, caspar-oesterheld, vincent-conitzer, vojta-kovarik, nathani... | ||
| Tools for aligning multiple AIs | Multi-agent first | Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings. | 12 | 10 - 15 | — | — | Coefficient Giving, Deepmind, Cooperative AI Foundation | andrew-critch, lewis-hammond, emery-cooper, allan-chan, caspar-oesterheld, vincent-conitzer, gillian... | |
| AGI metrics | Evals | Evals with the explicit aim of measuring progress towards full human-level generality. | 5 | 10-50 | — | — | Leverhulme Trust, Open Philanthropy, Long-Term Future Fund | cais, cfi-kinds-of-intelligence, apart-research, openai, metr, lexin-zhou, adam-scholl, lorenzo-pacc... | |
| AI deception evals | Evals | research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging. | 13 | 30-80 | Worst Case | — | Labs, academic institutions (e.g., Harvard, CMU, Barcelona I... | cadenza, fred-heiding, simon-lermen, andrew-kao, myra-cheng, cinoo-lee, pranav-khadpe, satyapriya-kr... | |
| AI scheming evals | Evals | Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and complian... | 7 | 30-60 | Pessimistic | — | OpenAI, Anthropic, Google DeepMind, Open Philanthropy | bronson-schoen, alexander-meinke, jason-wolfe, mary-phuong, rohin-shah, evgenia-nitishinskaya, mikit... | |
| Autonomy evals | Evals | Measure an AI's ability to act autonomously to complete long-horizon, complex tasks. | 13 | 10-50 | Average Case | — | The Audacious Project, Open Philanthropy | metr, thomas-kwa, ben-west, joel-becker, beth-barnes, hjalmar-wijk, tao-lin, giulio-starace, oliver-... | |
| Capability evals | Evals | Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better. | 34 | 100+ | Average Case | — | basically everyone. Google, Microsoft, Open Philanthropy, LT... | metr, aisi, apollo-research, marrius-hobbhahn, meg-tong, mary-phuong, beth-barnes, thomas-kwa, joel-... | |
| Other evals | Evals | A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy. | 20 | 20-50 | Average Case | — | Lab funders (OpenAI), Open Philanthropy (which funds CAIS, t... | richard-ren, mantas-mazeika, andrs-corrada-emmanuel, ariba-khan, stephen-casper | |
| Sandbagging evals | Evals | Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context. | 9 | 10-50 | Pessimistic | Anthropic (and its funders, e.g., Google, Amazon), UK Govern... | teun-van-der-weij, cameron-tice, chloe-li, johannes-gasteiger, joseph-bloom, joel-dyer | ||
| Self-replication evals | Evals | evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves. | 3 | 10-20 | Worst Case | UK Government (via UK AI Safety Institute) | sid-black, asa-cooper-stickland, jake-pencharz, oliver-sourbut, michael-schmatz, jay-bailey, ollie-m... | ||
| Situational awareness and self-awareness evals | Evals | Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment. | 11 | 30-70 | Worst Case | frontier labs (Google DeepMind, Anthropic), Open Philanthrop... | jan-betley, xuchan-bao, martn-soto, mary-phuong, roland-s-zimmermann, joe-needham, giles-edkins, gov... | ||
| Steganography evals | Evals | evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring. | 5 | 1-10 | Worst Case | Anthropic (and its general funders, e.g., Google, Amazon) | antonio-norelli, michael-bronstein | ||
| Various Redteams | Evals | attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods. | 57 | 100+ | Average Case | Frontier labs (Anthropic, OpenAI, Google), government (UK AI... | ryan-greenblatt, benjamin-wright, aengus-lynch, john-hughes, samuel-r-bowman, andy-zou, nicholas-car... | ||
| WMD evals (Weapons of Mass Destruction) | Evals | Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis. | 6 | 10-50 | Pessimistic | — | Open Philanthropy, UK AI Safety Institute (AISI), frontier l... | lennart-justen, haochen-zhao, xiangru-tang, ziran-yang, aidan-peppin, anka-reuel, stephen-casper |