Capability removal: unlearning
Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.
Theory of Change:If an AI learns dangerous knowledge (e.g., dual-use capabilities like virology or hacking, or knowledge of their own safety controls) or exhibits undesirable behaviors (e.g., memorizing private data), we can specifically erase this "bad" knowledge post-training, which is much cheaper and faster than retraining, thereby making the model safer. Alternatively, intervene in pre-training, to prevent the model from learning it in the first place (even when data filtering is imperfect). You could imagine also unlearning propensities to power-seeking, deception, sycophancy, or spite.
Target Case:Pessimistic
Some names:Rowan Wang, Avery Griffin, Johannes Treutlein, Zico Kolter, Bruce W. Lee, Addie Foote, Alex Infanger, Zesheng Shi, Yucheng Zhou, Jing Li, Timothy Qian, Stephen Casper, Alex Cloud
Estimated FTEs:10-50
Outputs:
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics— Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, Pratyush Maini
Modifying LLM Beliefs with Synthetic Document Finetuning— Rowan Wang, Avery Griffin, Johannes Treutlein, Ethan Perez, Julian Michael, Fabien Roger, Sam Marks
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization— Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou
Mirror Mirror on the Wall, Have I Forgotten it All? A New Framework for Evaluating Machine Unlearning— Brennon Brimhall, Philip Mathew, Neil Fendley, Yinzhi Cao, Matthew Green
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research— A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark A. Lemley, Nicolas Papernot, Katherine Lee
Open Problems in Machine Unlearning for AI Safety— Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal
Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning— Filip Sondej, Yushi Yang
Safety Alignment via Constrained Knowledge Unlearning— Zesheng Shi, Yucheng Zhou, Jing Li
Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization— Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs— Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu
Unlearning Needs to be More Selective [Progress Report]— Filip Sondej, Yushi Yang, Marcel Windys
Layered Unlearning for Adversarial Relearning— Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell
Understanding Memorization via Loss Curvature— Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities— Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks— Alex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, Alexander Matt Turner
Selective modularity: a research agenda— cloud, Jacob G-W
Distillation Robustifies Unlearning— Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs— Igor Shilov, Alex Cloud, Aryo Pradipta Gema, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil