Self-replication evals
evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.
Theory of Change:if AI agents gain the ability to self-replicate, they could proliferate uncontrollably, making them impossible to shut down. By measuring this capability with benchmarks like RepliBench, we can identify when models cross this dangerous "red line" and implement controls before losing containment.
General Approach:Behavioral
Target Case:Worst Case
Orthodox Problems:
Estimated FTEs:10-20
Critiques:
Outputs:
Large language model-powered AI systems achieve self-replication with no human intervention— Xudong Pan, Jiarun Dai, Yihe Fan, Minyuan Luo, Changyi Li, Min Yang
Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents— Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao