Methodology
Structure, sources, processing methods, and related reviews
Structure
We have again settled for a tree data structure for this post – but people and work can appear in multiple nodes so it’s not a dumb partition. Richer representation structures may be in the works.
The level of analysis for each node in the tree is the “research agenda”, an abstraction spanning multiple papers and organisations in a messy many-to-many relation. What makes something an agenda? Similar methods, similar aims, or something sociological about leaders and collaborators. Agendas vary greatly in their degree of coherent agency, from the very coherent Anthropic Circuits work to the enormous, leaderless and unselfconscious “iterative alignment”.
Scope
Time period: 30th November 2024 – 30th November 2025 (with a few exceptions).
We’re focussing on “technical AGI safety”. We thus ignore a lot of work relevant to the overall risk: misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and like products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We also mostly focus on papers and blogposts (rather than say, underground gdoc samizdat or Discords).
- We only use public information, so we are off by some additional unknown factor.
- We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
- Of the 2000+ links to papers, organizations and posts in the raw scrape, about 700 made it in.
Paper Sources
- All arXiv papers with "AI alignment", "AI safety", or "steerability" in the abstract or title; all papers of ~120 AI safety researchers
- All Alignment Forum posts and all LW posts under "AI"
- Gasteiger’s links, Paleka’s links, Lenz’s links, Zvi’s links
- Ad hoc Twitter for a year, several conference pages and workshops
Note on AI scrapers: AI scrapes miss lots of things. We did a proper pass with a software scraper of over 3000 links collected from the above and LLM crawl of some of the pages, and then an LLM pass to pre-filter the links for relevance and pre-assign the links to agendas and areas, but this also had systematic omissions. We ended up doing a full manual pass over a conservatively LLM-pre-filtered links and re-classifying the links and papers. The code and data can be found here, including the 3300 collected candidate links. We are not aware of any proper studies of “LLM laziness” but it’s known amongst power users of copilots.
For finding critiques we used LW backlinks, Google Scholar cited-by, manual search, collected links, and Ahrefs. Technical critiques are somewhat rare, though, and even then our coverage here is likely severely lacking. We generally do not include social network critiques (mostly due to scope).
Despite this effort we will not have included all relevant papers and names. The omission of a paper or a researcher is not strong negative evidence about their relevance.
Processing
- Collecting links throughout the year and at project start. Skimming papers, staring at long lists.
- We drafted a taxonomy of research agendas. Based on last year's list, our expertise and the initial paper collection, though we changed the structure to accommodate shifts in the domain: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”.
- At around 300 links collected manually (and growing fast), we decided to implement simple pipelines for crawling, scraping and LLM metadata extraction, pre-filtering and pre-classification into agendas, as well as other tasks – including final formatting later. The use of LLMs was limited to one simple task at a time, and the results were closely watched and reviewed. Code and data here.
- We tried getting the AI to update our taxonomy bottom-up to fit the body of papers, but the results weren’t very good. Though we are looking at some advanced options (specialized embedding or feature extraction and clustering or t-SNE etc.)
- Work on the ~70 agendas was distributed among the team. We ended up making many local changes to the taxonomy, esp. splitting up and merging agendas. The taxonomy is specific to this year, and will need to be adapted in coming years.
- We moved any agendas without public outputs this year to the bottom, and the inactive ones to the Graveyard. For most of them, we checked with people from the agendas for outputs or work we may have missed.
- What started as a brief summary editorial grew into its own thing (6000w).
- We asked 10 friends in AI safety to review the ~80 page draft. After editing and formatting, we asked 50 technical AI safety researchers for a quick review focused on their expertise.
- The field is growing at around 20% a year. There will come a time that this list isn't sensible to do manually even with the help of LLMs (at this granularity anyway). We may come up with better alternatives than lists and posts by then, though.
Taxonomy Classification
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
Other Reviews and Taxonomies
This review exists in the context of many other efforts to map AI safety research:
- aisafety.com org cards
- nonprofits.zone
- Leong and Linsefors
- Coefficient Giving RFP
- Peregrine Report
- The Singapore Consensus on Global AI Safety Research Priorities
- International AI Safety Report 2025 (and updates)
- A Comprehensive Survey in LLM(-Agent) Full Stack Safety
- plex's Review of AI safety funders
- The Alignment Project
- AI Awareness literature review
- aisafety.com self-study
- Zach Stein-Perlman’s list
- IAPS
- AI Safety Camp 10 Outputs
- The Road to Artificial SuperIntelligence
- AE Studio field guide
- AI Alignment: A Contemporary Survey
Major changes from 2024
- A few major changes to the taxonomy: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”. (We did try out an automated clustering but it wasn’t very good.)
- The agendas are in general less charisma-based and more about solution type.
- We did a systematic Arxiv scrape on the word “alignment” (and filtered out the sequence indexing papers that fell into this pipeline). “Steerability” is one competing term used by academics.
- We scraped >3000 links (arXiv, LessWrong, several alignment publication lists, blogs and conferences), conservatively filtering and pre-categorizing them with a LLM pipeline. All curated later, many more added manually.
- This review has ~800 links compared to ~300 in 2024 and ~200 in 2023. We looked harder and the field has grown.
- We don’t collate public funding figures.
- New sections: “Labs”, “Multi-agent First”, “Better data”, “Model specs”, “character training” and “representation geometry”. ”Evals” is so massive it gets a top-level section.
Orgs without public outputs this year
We are not aware of public technical AI safety output with these agendas and organizations, though they are active otherwise.
Graveyard (known to be inactive)
See also: About this review