Shallow Review of Technical AI Safety, 2025

Capability removal: unlearning

Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.
Theory of Change:If an AI learns dangerous knowledge (e.g., dual-use capabilities like virology or hacking, or knowledge of their own safety controls) or exhibits undesirable behaviors (e.g., memorizing private data), we can specifically erase this "bad" knowledge post-training, which is much cheaper and faster than retraining, thereby making the model safer. Alternatively, intervene in pre-training, to prevent the model from learning it in the first place (even when data filtering is imperfect). You could imagine also unlearning propensities to power-seeking, deception, sycophancy, or spite.
Target Case:Pessimistic
Some names:Rowan Wang, Avery Griffin, Johannes Treutlein, Zico Kolter, Bruce W. Lee, Addie Foote, Alex Infanger, Zesheng Shi, Yucheng Zhou, Jing Li, Timothy Qian, Stephen Casper, Alex Cloud
Estimated FTEs:10-50
Outputs:
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and MetricsVineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, Pratyush Maini
Modifying LLM Beliefs with Synthetic Document FinetuningRowan Wang, Avery Griffin, Johannes Treutlein, Ethan Perez, Julian Michael, Fabien Roger, Sam Marks
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space RegularizationShoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou
Mirror Mirror on the Wall, Have I Forgotten it All? A New Framework for Evaluating Machine UnlearningBrennon Brimhall, Philip Mathew, Neil Fendley, Yinzhi Cao, Matthew Green
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and ResearchA. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark A. Lemley, Nicolas Papernot, Katherine Lee
Open Problems in Machine Unlearning for AI SafetyFazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal
Safety Alignment via Constrained Knowledge UnlearningZesheng Shi, Yucheng Zhou, Jing Li
Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And NormalizationFilip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMsXiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu
Unlearning Needs to be More Selective [Progress Report]Filip Sondej, Yushi Yang, Marcel Windys
Layered Unlearning for Adversarial RelearningTimothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell
Understanding Memorization via Loss CurvatureJack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
Model Tampering Attacks Enable More Rigorous Evaluations of LLM CapabilitiesZora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
Gradient Routing: Masking Gradients to Localize Computation in Neural NetworksAlex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, Alexander Matt Turner
Distillation Robustifies UnlearningBruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMsIgor Shilov, Alex Cloud, Aryo Pradipta Gema, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil