Shallow Review of Technical AI Safety, 2025
Name
Section
Summary
Papers
FTEs
Target Case
Approaches
Problems
Funded By
Names
AnthropicLabs21
Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Bai...chris-olah, evan-hubinger, sam-marks, johannes-treutlein, sam-bowman, euan-ong, fabien-roger, adam-j...
ChinaLabs0
Google DeepmindLabs14
Google. Explicit 2024 Deepmind spending as a whole was [£1.3...rohin-shah, allan-dafoe, anca-dragan, alex-irpan, alex-turner, anna-wang, arthur-conmy, david-lindne...
MetaLabs6
Metashuchao-bi, hongyuan-zhan, jingyu-zhang, haozhu-wang, eric-michael-smith, sid-wang, amr-sharaf, mahe...
OpenAILabs12
Microsoft, [AWS](https://www.aboutamazon.com/news/aws/aws-op...johannes-heidecke, boaz-barak, mia-glaese, jenny-nitishinskaya, lama-ahmad, naomi-bashkansky, miles-...
xAILabs0
A16Z, Blackrock, Fidelity, Kingdom, Lightspeed, MGX, Morgan...dan-hendrycks-advisor, juntang-zhuang, toby-pohlen, lianmin-zheng, piaoyang-cui, nikita-popov, ying-...
Assistance games, assistive agentsBlack-box safetyFormalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.5
Future of Life Institute, Coefficient Giving, Survival and F...joar-skalse, anca-dragan, caspar-oesterheld, david-krueger, dylan-hafield-menell, stuart-russell
Black-box make-AI-solve-itBlack-box safetyFocus on using existing models to improve and align further models.12Average Case
most of the industryjacques-thibodeau, matthew-shingle, nora-belrose, lewis-hammond, geoffrey-irving
Capability removal: unlearningBlack-box safetyDeveloping methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approache...1810-50Pessimistic
Coefficient Giving, MacArthur Foundation, UK AI Safety Insti...rowan-wang, avery-griffin, johannes-treutlein, zico-kolter, bruce-w-lee, addie-foote, alex-infanger,...
Chain of thought monitoringBlack-box safetySupervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.1710-100Average CaseOpenAI, Anthropic, Google DeepMindaether, bowen-baker, joost-huizinga, leo-gao, scott-emmons, erik-jenner, yanda-chen, james-chua, owa...
Character training and persona steeringBlack-box safetyMap, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behavio...13Average CaseAnthropic, Coefficient Givingtruthful-ai, openai, anthropic, clr, amanda-askell, jack-lindsey, janus, theia-vogel, sharan-maiya,...
ControlBlack-box safetyIf we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watch...225-50Worst Case
redwood, uk-aisi, deepmind, openai, anthropic, buck-shlegeris, ryan-greenblatt, kshitij-sachan, alex...
Data filteringBlack-box safetyBuilds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.410-50Average CaseAnthropic, various academicsyanda-chen, pratyush-maini, kyle-obrien, stephen-casper, simon-pepin-lehalleur, jesse-hoogland, hima...
Data poisoning defenseBlack-box safetyDevelops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.35-20PessimisticGoogle DeepMind, Anthropic, University of Cambridge, Vector...alexandra-souly, javier-rando, ed-chapman, hanna-foerster, ilia-shumailov, yiren-zhao
Data quality for alignmentBlack-box safetyImproves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.520-50Average CaseAnthropic, Google DeepMind, OpenAI, Meta AI, various academi...maarten-buyl, kelsey-kraus, margaret-kroll, danqing-shi
Emergent misalignmentBlack-box safetyFine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained...1710-50PessimisticCoefficient Giving, >$1 milliontruthful-ai, jan-betley, james-chua, mia-taylor, miles-wang, edward-turner, anna-soligo, alex-cloud,...
Harm reduction for open weightsBlack-box safetyDevelops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or...510-100Average CaseUK AI Safety Institute (AISI), EleutherAI, Coefficient Givin...kyle-obrien, stephen-casper, quentin-anthony, tomek-korbak, rishub-tamirisa, mantas-mazeika, stella-...
Hyperstition studiesBlack-box safetyStudy, steer, and intervene on the following feedback loop: "we produce stories about how present and future AI systems behave" → "these stories become training data for the AI" → "these stories shape...41-10Average CaseUnclear, nichealex-turner, hyperstition-aihttpswwwhyperstitionaicom, kyle-obrien
Inference-time: In-context learningBlack-box safetyInvestigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.5Average Case
jacob-steinhardt, kayo-yin, atticus-geiger
Inference-time: SteeringBlack-box safetyManipulate an LLM's internal representations/token probabilities without touching weights.4Average Case
taylor-sorensen, constanza-fierro, kshitish-ghate, arthur-vogels
Inoculation promptingBlack-box safetyPrompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.4Average Case
most of the industryariana-azarbal, daniel-tan, victor-gillioz, alex-turner, alex-cloud, monte-macdiarmid, daniel-ziegle...
Iterative alignment at post-train-timeBlack-box safetyModify weights after pre-training.16Average Case
most of the industryadam-gleave, anca-dragan, jacob-steinhardt, rohin-shah
Iterative alignment at pretrain-timeBlack-box safetyGuide weights during pretraining.2Average Case
most of the industryjan-leike, stuart-armstrong, cyrus-cousins, oliver-daniels
Mild optimisationBlack-box safetyAvoid Goodharting by getting AI to satisfice rather than maximise.410-50Google DeepMind
Model psychopathologyBlack-box safetyFind interesting LLM phenomena like glitch [tokens](https://vgel.me/posts/seahorse/) and the reversal curse; these are vital data for theory.95-20Pessimistic
Coefficient Giving (via Truthful AI and Interpretability gra...janus, truthful-ai, theia-vogel, stewart-slocum, nell-watson, samuel-g-b-johnson, liwei-jiang, monik...
Model specs and constitutionsBlack-box safetyWrite detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.11Average Casemajor funders include Anthropic and OpenAI (internally)amanda-askell, joe-carlsmith
Model values / model preferencesBlack-box safetyAnalyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.1430PessimisticCoefficient Giving. $289,000 SFF funding for CAIS.mantas-mazeika, xuwang-yin, rishub-tamirisa, jaehyuk-lim, bruce-w-lee, richard-ren, long-phan, norma...
RL safetyBlack-box safetyImproves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.1120-70PessimisticGoogle DeepMind, University of Oxford, CMU, Coefficient Givi...joar-skalse, karim-abdel-sadek, matthew-farrugia-roberts, benjamin-plaut, fang-wu, stephen-zhao, ale...
Safeguards (inference-time auxiliaries)Black-box safetyLayers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.6100+Average Casemost of the big labsmrinank-sharma, meg-tong, jesse-mu, alwin-peng, julian-michael, henry-sleight, theodore-sumers, raj-...
Synthetic data for alignmentBlack-box safetyUses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.850-150Average CaseAnthropic, Google DeepMind, OpenAI, Meta AI, various academi...mianqiu-huang, xiaoran-liu, rylan-schaeffer, nevan-wichers, aram-ebtekar, jiaxin-wen, vishakh-padmak...
The "Neglected Approaches" ApproachBlack-box safetyAgenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.315Average CaseAE Studioae-studio, gunnar-zarncke, cameron-berg, michael-vaiana, judd-rosenblatt, diogo-schwerz-de-lucena
Activation engineeringWhite-box safetyProgrammatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.1520-100Average Case
Coefficient Giving, Anthropicrunjin-chen, andy-arditi, david-krueger, jan-wehner, narmeen-oozeer, reza-bayat, adam-karvonen, jiud...
Causal AbstractionsWhite-box safetyVerify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.310-30Worst CaseVarious academic groups, Google DeepMind, Goodfireatticus-geiger, christopher-potts, thomas-icard, theodora-mara-pslar, sara-magliacane, jiuding-sun,...
Data attributionWhite-box safetyQuantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back t...1230-60Average CaseVarious academic groupsroger-grosse, philipp-alexander-kreer, jin-hwa-lee, matthew-smith, abhilasha-ravichander, andrew-wan...
Extracting latent knowledgeWhite-box safetyIdentify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.920-40Worst CaseOpen Philanthropy, Anthropic, NSF, various academic grantsbartosz-cywiski, emil-ryd, senthooran-rajamanoharan, alexander-pan, lijie-chen, jacob-steinhardt, ja...
Human inductive biasesWhite-box safetyDiscover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans a...64PessimisticGoogle DeepMind, various academic groupslukas-muttenthaler, quentin-delfosse
Learning dynamics and developmental interpretabilityWhite-box safetyBuilds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context...1410-50Worst CaseManifund, Survival and Flourishing Fund, EA Fundstimaeus, jesse-hoogland, george-wang, daniel-murfet, stan-van-wingerden, alexander-gietelink-oldenzi...
Lie and deception detectorsWhite-box safetyDetect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model state...1110-50Pessimistic
Anthropic, Deepmind, UK AISI, Coefficient Givingcadenza, sam-marks, rowan-wang, kieron-kretschmar, sharan-maiya, walter-laurito, chris-cundy, adam-g...
Model diffingWhite-box safetyUnderstand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.910-30Pessimisticvarious academic groups, Anthropic, Google DeepMindjulian-minder, clment-dumas, neel-nanda, trenton-bricken, jack-lindsey
Monitoring conceptsWhite-box safetyIdentifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them...1150-100PessimisticCoefficient Giving, Anthropic, various academic groupsdaniel-beaglehole, adityanarayanan-radhakrishnan, enric-boix-adser, tom-wollschlger, anna-soligo, ja...
Other interpretabilityWhite-box safetyInterpretability that does not fall well into other categories.1930-60
lee-sharkey, dario-amodei, david-chalmers, been-kim, neel-nanda, david-d-baek, lauren-greenspan, dmi...
Pragmatic interpretabilityWhite-box safetyDirectly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pur...330-60Google DeepMind, Anthropic, various academic groupslee-sharkey, dario-amodei, david-chalmers, been-kim, neel-nanda, david-d-baek, lauren-greenspan, dmi...
Representation structure and geometryWhite-box safetyWhat do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?1310-50Various academic groups, Astera Institute, Coefficient Givin...simplex, insight-interaction-lab, paul-riechers, adam-shai, martin-wattenberg, blake-richards, mateu...
Reverse engineeringWhite-box safetyDecompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's...33100-200Worst Caselucius-bushnaq, dan-braun, lee-sharkey, aaron-mueller, atticus-geiger, sheridan-feucht, david-bau, y...
Sparse CodingWhite-box safetyDecompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.4450-100Average Case
everyone, roughly. Frontier labs, LTFF, Coefficient Giving,...leo-gao, dan-mossing, emmanuel-ameisen, jack-lindsey, adam-pearce, thomas-heap, abhinav-menon, kenny...
Brainlike-AGI SafetySafety by constructionSocial and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. "a yet-to-b...61-5Worst Case
Astera Institutesteve-byrnes
Guaranteed-Safe AISafety by constructionHave an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.510-100Worst Case
Manifund, ARIA, Coefficient Giving, Survival and Flourishing...aria, lawzero, atlas-computing, flf, max-tegmark, beneficial-ai-foundation, steve-omohundro, david-d...
Scientist AISafety by constructionDevelop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs21-10PessimisticARIA, Gates Foundation, Future of Life Institute, Coefficien...yoshua-bengio, younesse-kaddar
AI explanations of AIsMake AI solve itMake open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that...515-30PessimisticSchmidt Sciences, Halcyon Futures, John Schulman, Wojciech Z...transluce, jacob-steinhardt, neil-chowdhury, vincent-huang, sarah-schwettmann, robert-friel
DebateMake AI solve itIn the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.6Worst Case
Google, othersrohin-shah, jonah-brown-cohen, georgios-piliouras, uk-aisi-benjamin-holton
LLM introspection trainingMake AI solve itTrain LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own 'introspective' access22-20Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Z...belinda-z-li, zifan-carl-guo, vincent-huang, jacob-steinhardt, jacob-andreas, jack-lindsey
Supervising AIs improving AIsMake AI solve itBuild formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, bench...81-10PessimisticLong-Term Future Fund, lab fundersroman-engeler, akbir-khan, ethan-perez
Weak-to-strong generalizationMake AI solve itUse weaker models to supervise and provide a feedback signal to stronger models.42-20Average Caselab funders, Eleuther fundersjoshua-engels, nora-belrose, david-d-baek
Agent foundationsTheoryDevelop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theor...10Worst Caseabram-demski, alex-altair, sam-eisenstat, thane-ruthenis, alfred-harwood, daniel-c, dalcy-k, jos-ped...
Asymptotic guaranteesTheoryProve that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory,...45 - 10PessimisticAISIaisi, jacob-pfau, benjamin-hilton, geoffrey-irving, simon-marshall, will-kirby, martin-soto, david-a...
Behavior alignment theoryTheoryPredict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.101-10Worst Case
ram-potham, michael-k-cohen, max-harmsraelifin, john-wentworth, david-lorell, elliott-thornley
Heuristic explanationsTheoryFormalize mechanistic explanations of neural network behavior, automate the discovery of these "heuristic explanations" and use them to predict when novel input will lead to extreme behavior (i.e. "Lo...51-10Worst Case
jacob-hilton, mark-xu, eric-neyman, victor-lecomte, george-robinson
High-Actuation SpacesTheoryMech interp and alignment assume a stable "computational substrate" (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will...71-10Pessimistic
sahil-k, matt-farr, aditya-arpitha-prasad, chris-pang, aditya-adiga, jayson-amati, steve-petersen, t...
Natural abstractionsTheoryDevelop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of...101-10Worst Casejohn-wentworth, paul-colognese, david-lorrell, sam-eisenstat, fernando-rosas
Other corrigibilityTheoryDiagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors91-10Pessimistic
jeremy-gillen
The Learning-Theoretic AgendaTheoryCreate a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology...63Worst CaseSurvival and Flourishing Fund, ARIA, UK AISI, Coefficient Gi...vanessa-kosoy, diffractor, gergely-szcs
Tiling agentsTheoryAn aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.41-10Worst Caseabram-demski
Aligned to who?Multi-agent firstTechnical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to "humanity"95 - 15Average CaseFuture of Life Institute, Survival and Flourishing Fund, Dee...joel-z-leibo, divya-siddarth, sb-krier, luke-thorburn, seth-lazar, ai-objectives-institute, the-coll...
Aligning to contextMulti-agent firstAlign AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility f...85ARIA, OpenAI, Survival and Flourishing Fundfull-stack-alignment, meaning-alignment-institute, plurality-institute, tan-zhi-xuan, matija-frankli...
Aligning to the social contractMulti-agent firstGenerate AIs' operational values from 'social contract'-style ideal civic deliberation formalisms and their consequent rulesets for civic actors85 - 10Deepmind, Macroscopic Venturesgillian-hadfield, tan-zhi-xuan, sydney-levine, matija-franklin, joshua-b-tenenbaum
Aligning what?Multi-agent firstDevelop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/constituti...135-10
Future of Life Institute, Emmett Shearrichard-ngo, emmett-shear, softmax, full-stack-alignment, ai-objectives-institute, sahil, tj, andrew...
Theory for aligning multiple AIsMulti-agent firstUse realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/predict the collective and individual behaviours of AI ag...1210SFF, CAIF, Deepmind, Macroscopic Ventureslewis-hammond, emery-cooper, allan-chan, caspar-oesterheld, vincent-conitzer, vojta-kovarik, nathani...
Tools for aligning multiple AIsMulti-agent firstDevelop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.1210 - 15
Coefficient Giving, Deepmind, Cooperative AI Foundationandrew-critch, lewis-hammond, emery-cooper, allan-chan, caspar-oesterheld, vincent-conitzer, gillian...
AGI metricsEvalsEvals with the explicit aim of measuring progress towards full human-level generality.510-50
Leverhulme Trust, Open Philanthropy, Long-Term Future Fundcais, cfi-kinds-of-intelligence, apart-research, openai, metr, lexin-zhou, adam-scholl, lorenzo-pacc...
AI deception evalsEvalsresearch demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.1330-80Worst Case
Labs, academic institutions (e.g., Harvard, CMU, Barcelona I...cadenza, fred-heiding, simon-lermen, andrew-kao, myra-cheng, cinoo-lee, pranav-khadpe, satyapriya-kr...
AI scheming evalsEvalsEvaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and complian...730-60Pessimistic
OpenAI, Anthropic, Google DeepMind, Open Philanthropybronson-schoen, alexander-meinke, jason-wolfe, mary-phuong, rohin-shah, evgenia-nitishinskaya, mikit...
Autonomy evalsEvalsMeasure an AI's ability to act autonomously to complete long-horizon, complex tasks.1310-50Average Case
The Audacious Project, Open Philanthropymetr, thomas-kwa, ben-west, joel-becker, beth-barnes, hjalmar-wijk, tao-lin, giulio-starace, oliver-...
Capability evalsEvalsMake tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.34100+Average Case
basically everyone. Google, Microsoft, Open Philanthropy, LT...metr, aisi, apollo-research, marrius-hobbhahn, meg-tong, mary-phuong, beth-barnes, thomas-kwa, joel-...
Other evalsEvalsA collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.2020-50Average Case
Lab funders (OpenAI), Open Philanthropy (which funds CAIS, t...richard-ren, mantas-mazeika, andrs-corrada-emmanuel, ariba-khan, stephen-casper
Sandbagging evalsEvalsEvaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.910-50PessimisticAnthropic (and its funders, e.g., Google, Amazon), UK Govern...teun-van-der-weij, cameron-tice, chloe-li, johannes-gasteiger, joseph-bloom, joel-dyer
Self-replication evalsEvalsevaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.310-20Worst CaseUK Government (via UK AI Safety Institute)sid-black, asa-cooper-stickland, jake-pencharz, oliver-sourbut, michael-schmatz, jay-bailey, ollie-m...
Situational awareness and self-awareness evalsEvalsEvaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.1130-70Worst Casefrontier labs (Google DeepMind, Anthropic), Open Philanthrop...jan-betley, xuchan-bao, martn-soto, mary-phuong, roland-s-zimmermann, joe-needham, giles-edkins, gov...
Steganography evalsEvalsevaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.51-10Worst CaseAnthropic (and its general funders, e.g., Google, Amazon)antonio-norelli, michael-bronstein
Various RedteamsEvalsattack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.57100+Average CaseFrontier labs (Anthropic, OpenAI, Google), government (UK AI...ryan-greenblatt, benjamin-wright, aengus-lynch, john-hughes, samuel-r-bowman, andy-zou, nicholas-car...
WMD evals (Weapons of Mass Destruction)EvalsEvaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.610-50Pessimistic
Open Philanthropy, UK AI Safety Institute (AISI), frontier l...lennart-justen, haochen-zhao, xiangru-tang, ziran-yang, aidan-peppin, anka-reuel, stephen-casper