Inference-time: In-context learning
Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.
Theory of Change:LLMs don't seem very dangerous and might scale to AGI, things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, assume that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature, assume that tuning for what we want will also get us to avoid what we don't want. Maybe assume that thoughts are translucent.
General Approach:Engineering
Target Case:Average Case
See Also:
model spec as prompt, Model specs and constitutions
Some names:Jacob Steinhardt, Atticus Geiger
Outputs:
InvThink: Towards AI Safety via Inverse Reasoning— Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park
Inference-Time Reward Hacking in Large Language Models— Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon
Understanding In-context Learning of Addition via Activation Subspaces— Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context— Yoav Gur-Arieh, Mor Geva, Atticus Geiger
Which Attention Heads Matter for In-Context Learning?— Kayo Yin, Jacob Steinhardt