Shallow Review of Technical AI Safety, 2025

Inference-time: In-context learning

Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.
Theory of Change:LLMs don't seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don't want. Maybe assume that thoughts are translucent.
General Approach:Engineering
Target Case:Average Case
See Also:
model spec as prompt, Model specs and constitutions
Some names:Jacob Steinhardt, Atticus Geiger
Outputs:
InvThink: Towards AI Safety via Inverse ReasoningYubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park
Inference-Time Reward Hacking in Large Language ModelsHadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon
Understanding In-context Learning of Addition via Activation SubspacesXinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen