Shallow Review of Technical AI Safety, 2025

Data poisoning defense

Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Theory of Change:By identifying and filtering out malicious training examples, we can prevent attackers from creating hidden backdoors or triggers that would cause aligned models to behave dangerously.
General Approach:Engineering
Target Case:Pessimistic
Some names:Alexandra Souly, Javier Rando, Ed Chapman, Hanna Foerster, Ilia Shumailov, Yiren Zhao
Estimated FTEs:5-20
Outputs:
LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to RepresentationsDaniela Gottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context LearningWeishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang