Data poisoning defense

Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.

Theory of Change:By identifying and filtering out malicious training examples, we can prevent attackers from creating hidden backdoors or triggers that would cause aligned models to behave dangerously.

General Approach:Engineering

Target Case:Pessimistic

Orthodox Problems:

8.Superintelligence can hack software supervisors 11.Someone else will deploy unsafe superintelligence first

Some names:Alexandra Souly, Javier Rando, Ed Chapman, Hanna Foerster, Ilia Shumailov, Yiren Zhao

Estimated FTEs:5-20

Critiques:

A small number of samples can poison LLMs of any size, Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Outputs:

A small number of samples can poison LLMs

LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations— Daniela Gottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva

GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning— Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang