Shallow Review of Technical AI Safety, 2025

Data attribution

Quantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.
Theory of Change:By attributing harmful, biased, or unaligned behaviors to specific training examples, researchers can audit proprietary models, debug training data, enable effective data deletion/unlearning
General Approach:Behavioral
Target Case:Average Case
Some names:Philipp Alexander Kreer, Jin Hwa Lee, Matthew Smith, Abhilasha Ravichander, Andrew Wang, Jiacheng Liu, Jiaqi Ma, Junwei Deng, Yijun Pan, Jesse Hoogland
Estimated FTEs:30-60
Outputs:
Influence Dynamics and Stagewise Data AttributionJin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence FunctionsSang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, Eric Xing
Better Training Data Attribution via Better Inverse Hessian-Vector ProductsAndrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A. McIlraith, Roger Grosse
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language ModelsCathy Jiao, Yijun Pan, Emily Xiao, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma, Chenyan Xiong
Bayesian Influence Functions for Hessian-Free Data AttributionPhilipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training TokensJiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge
You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and GeneralisationSimon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, George Wang, Liam Carroll, Daniel Murfet
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language ModelsAbhilasha Ravichander, Jillian Fisher, Taylor Sorensen, Ximing Lu, Yuchen Lin, Maria Antoniak, Niloofar Mireshghallah, Chandra Bhagavatula, Yejin Choi
Distributional Training Data Attribution: What do Influence Functions Sample?Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu, Richard E. Turner, Roger Grosse
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement LearningYuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao