Data poisoning defense
Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Theory of Change:By identifying and filtering out malicious training examples, we can prevent attackers from creating hidden backdoors or triggers that would cause aligned models to behave dangerously.
General Approach:Engineering
Target Case:Pessimistic
See Also:
Data filtering, Safeguards (inference-time auxiliaries), Various Redteams, adversarial robustness
Some names:Alexandra Souly, Javier Rando, Ed Chapman, Hanna Foerster, Ilia Shumailov, Yiren Zhao
Estimated FTEs:5-20
Outputs:
LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations— Daniela Gottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning— Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang