Shallow Review of Technical AI Safety, 2025

Self-replication evals

evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.
Theory of Change:if AI agents gain the ability to self-replicate, they could proliferate uncontrollably, making them impossible to shut down. By measuring this capability with benchmarks like RepliBench, we can identify when models cross this dangerous "red line" and implement controls before losing containment.
General Approach:Behavioral
Target Case:Worst Case
Estimated FTEs:10-20
Critiques: