Capability removal: unlearning

Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.

Theory of Change:If an AI learns dangerous knowledge (e.g., dual-use capabilities like virology or hacking, or knowledge of their own safety controls) or exhibits undesirable behaviors (e.g., memorizing private data), we can specifically erase this "bad" knowledge post-training, which is much cheaper and faster than retraining, thereby making the model safer. Alternatively, intervene in pre-training, to prevent the model from learning it in the first place (even when data filtering is imperfect). You could imagine also unlearning propensities to power-seeking, deception, sycophancy, or spite.

Target Case:Pessimistic

Orthodox Problems:

8.Superintelligence can hack software supervisors 12.A boxed AGI might exfiltrate itself 10.Humanlike minds/goals are not necessarily safe

Some names:Rowan Wang, Avery Griffin, Johannes Treutlein, Zico Kolter, Bruce W. Lee, Addie Foote, Alex Infanger, Zesheng Shi, Yucheng Zhou, Jing Li, Timothy Qian, Stephen Casper, Alex Cloud

Estimated FTEs:10-50

Critiques:

Existing Large Language Model Unlearning Evaluations Are Inconclusive

Outputs:

OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics— Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, Pratyush Maini

Modifying LLM Beliefs with Synthetic Document Finetuning— Rowan Wang, Avery Griffin, Johannes Treutlein, Ethan Perez, Julian Michael, Fabien Roger, Sam Marks

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization— Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou

Mirror Mirror on the Wall, Have I Forgotten it All? A New Framework for Evaluating Machine Unlearning— Brennon Brimhall, Philip Mathew, Neil Fendley, Yinzhi Cao, Matthew Green

Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research— A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark A. Lemley, Nicolas Papernot, Katherine Lee

Open Problems in Machine Unlearning for AI Safety— Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning— Filip Sondej, Yushi Yang

Safety Alignment via Constrained Knowledge Unlearning— Zesheng Shi, Yucheng Zhou, Jing Li

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization— Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs— Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu

Unlearning Needs to be More Selective [Progress Report]— Filip Sondej, Yushi Yang, Marcel Windys

Layered Unlearning for Adversarial Relearning— Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell

Understanding Memorization via Loss Curvature— Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities— Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks— Alex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, Alexander Matt Turner

Selective modularity: a research agenda— cloud, Jacob G-W

Distillation Robustifies Unlearning— Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs— Igor Shilov, Alex Cloud, Aryo Pradipta Gema, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil