Harm reduction for open weights
Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.
Theory of Change:Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model's learned knowledge and capabilities (e.g., by ensuring "deep ignorance" of dual-use information), the safeguards become far more difficult and expensive to remove.
General Approach:Engineering
Target Case:Average Case
Orthodox Problems:
Some names:Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Rishub Tamirisa, Mantas Mazeika, Stella Biderman, Yarin Gal
Estimated FTEs:10-100
Outputs:
Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs— Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman
Tamper-Resistant Safeguards for Open-Weight LLMs— Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika
Open Technical Problems in Open-Weight AI Model Risk Management— Stephen Casper, Kyle O'Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, Sören Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, J. Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, Dylan Hadfield-Menell
A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety— Camille François, Ludovic Péran, Ayah Bdeir, Nouha Dziri, Will Hawkins, Yacine Jernite, Sayash Kapoor, Juliet Shen, Heidy Khlaaf, Kevin Klyman, Nik Marda, Marie Pellat, Deb Raji, Divya Siddarth, Aviya Skowron, Joseph Spisak, Madhulika Srikumar, Victor Storchan, Audrey Tang, Jen Weedon